Show simple item record

dc.rights.licenseCC-BY-NC-ND
dc.contributor.advisorYumak, Z.
dc.contributor.authorStan, Stefan
dc.date.accessioned2024-11-01T01:01:52Z
dc.date.available2024-11-01T01:01:52Z
dc.date.issued2024
dc.identifier.urihttps://studenttheses.uu.nl/handle/20.500.12932/48066
dc.description.abstractSpeech-driven facial animation synthesis has been a notable area of research in recent years, with new state-of-the-art approaches constantly emerging. Deep-learning techniques have demonstrated remarkable results for this task, clearly outperforming procedural methods. However, we notice a relevant scarcity of methods for generating rigged character facial animations that could be seamlessly integrated into animation pipelines. There is also a lack of non-deterministic methods that can produce a high variety of animations. In this paper, we present FaceDiffuser, a deep-learning model able to generate expressive and diverse facial animation sequences based on speech audio input. To the best of our knowledge, we are the first to employ the diffusion mechanism for the task of 3D facial animation synthesis, leveraging their non-deterministic nature for a better and more expressive facial animation generation. We use a pre-trained large speech representation model, HuBERT, to extract speech features, as it has proven to be effective even in noisy audio cases, making our model more robust to noisy settings. We show that our model is robust to noisy audio and can be used to animate 3D vertex facial meshes as well as rigged characters. We utilise 4D facial scan datasets as well as datasets containing rigged character animations, such as our in-house dataset UUDaMM, along with the recently released BEAT blendshape-based dataset. The results are assessed using both subjective and objective metrics in comparison to state-of-the-art results as well as the ground truth data. We show that our model performs objectively better than state-of-the-art techniques, our model producing lower lip vertex error than the competitors. In terms of the qualitative evaluation, we show by means of a user study, that our model clearly outperforms one of the state-of-the-art methods, while being rated similarly or slightly worse than the other 2 competitors. Furthermore, ablation of the diffusion component shows better performance over a variant of the model that does not use diffusion, strengthening our intuition over the benefits of the diffusion process.
dc.description.sponsorshipUtrecht University
dc.language.isoEN
dc.subjectIn this project, we investigate how the facial animation of a character can be generated using audio input. We employ deep learning algorithms and are the first to use diffusion models for the task of generating 3D facial animation. We use multiple datasets and are successful in generating facial animations both as vertex displacements as well as rig control values or blendshape weights. Our animation results are expressive and non-deterministic and objectively outperform state-of-the-art.
dc.titleFaceDiffuser: Speech-Driven Facial Animation Synthesis Using Diffusion
dc.type.contentMaster Thesis
dc.rights.accessrightsOpen Access
dc.subject.keywordsfacial animation synthesis; deep learning; virtual humans; mesh animation; blendshape animation
dc.subject.courseuuGame and Media Technology
dc.thesis.id21361


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record