FaceDiffuser: Speech-Driven 3D Facial Animation Synthesis Using Diffusion

Speech-driven 3D facial animation synthesis has been a challenging task bothin industry and research. Recent methods mostly focus on deterministic deeplearning methods meaning that given a speech input, the output is always thesame. However, in reality, the non-verbal facial cues that reside throughoutthe face are non-deterministic in nature. In addition, majority of theapproaches focus on 3D vertex based datasets and methods that are compatiblewith existing facial animation pipelines with rigged characters is scarce. Toeliminate these issues, we present FaceDiffuser, a non-deterministic deeplearning model to generate speech-driven facial animations that is trained withboth 3D vertex and blendshape based datasets. Our method is based on thediffusion technique and uses the pre-trained large speech representation modelHuBERT to encode the audio input. To the best of our knowledge, we are thefirst to employ the diffusion method for the task of speech-driven 3D facialanimation synthesis. We have run extensive objective and subjective analysesand show that our approach achieves better or comparable results in comparisonto the state-of-the-art methods. We also introduce a new in-house dataset thatis based on a blendshape based rigged character. We recommend watching theaccompanying supplementary video. The code and the dataset will be publiclyavailable.