FaceDiffuser: Speech-Driven Facial Animation Synthesis Using Diffusion

Stan, Stefan

dc.rights.license	CC-BY-NC-ND
dc.contributor.advisor	Yumak, Z.
dc.contributor.author	Stan, Stefan
dc.date.accessioned	2024-11-01T01:01:52Z
dc.date.available	2024-11-01T01:01:52Z
dc.date.issued	2024
dc.identifier.uri	https://studenttheses.uu.nl/handle/20.500.12932/48066
dc.description.abstract	Speech-driven facial animation synthesis has been a notable area of research in recent years, with new state-of-the-art approaches constantly emerging. Deep-learning techniques have demonstrated remarkable results for this task, clearly outperforming procedural methods. However, we notice a relevant scarcity of methods for generating rigged character facial animations that could be seamlessly integrated into animation pipelines. There is also a lack of non-deterministic methods that can produce a high variety of animations. In this paper, we present FaceDiffuser, a deep-learning model able to generate expressive and diverse facial animation sequences based on speech audio input. To the best of our knowledge, we are the first to employ the diffusion mechanism for the task of 3D facial animation synthesis, leveraging their non-deterministic nature for a better and more expressive facial animation generation. We use a pre-trained large speech representation model, HuBERT, to extract speech features, as it has proven to be effective even in noisy audio cases, making our model more robust to noisy settings. We show that our model is robust to noisy audio and can be used to animate 3D vertex facial meshes as well as rigged characters. We utilise 4D facial scan datasets as well as datasets containing rigged character animations, such as our in-house dataset UUDaMM, along with the recently released BEAT blendshape-based dataset. The results are assessed using both subjective and objective metrics in comparison to state-of-the-art results as well as the ground truth data. We show that our model performs objectively better than state-of-the-art techniques, our model producing lower lip vertex error than the competitors. In terms of the qualitative evaluation, we show by means of a user study, that our model clearly outperforms one of the state-of-the-art methods, while being rated similarly or slightly worse than the other 2 competitors. Furthermore, ablation of the diffusion component shows better performance over a variant of the model that does not use diffusion, strengthening our intuition over the benefits of the diffusion process.
dc.description.sponsorship	Utrecht University
dc.language.iso	EN
dc.subject	In this project, we investigate how the facial animation of a character can be generated using audio input. We employ deep learning algorithms and are the first to use diffusion models for the task of generating 3D facial animation. We use multiple datasets and are successful in generating facial animations both as vertex displacements as well as rig control values or blendshape weights. Our animation results are expressive and non-deterministic and objectively outperform state-of-the-art.
dc.title	FaceDiffuser: Speech-Driven Facial Animation Synthesis Using Diffusion
dc.type.content	Master Thesis
dc.rights.accessrights	Open Access
dc.subject.keywords	facial animation synthesis; deep learning; virtual humans; mesh animation; blendshape animation
dc.subject.courseuu	Game and Media Technology
dc.thesis.id	21361

Files in this item

Name:: MasterThesis.pdf
Size:: 14.29Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Theses

Show simple item record