FaceXHuBERT: Text-less Speech-driven E(X)pressive 3D Facial Animation Synthesis Using Self-Supervised Speech Representation Learning

This paper presents FaceXHuBERT, a text-less speech-driven 3D facialanimation generation method that allows to capture personalized and subtle cuesin speech (e.g. identity, emotion and hesitation). It is also very robust tobackground noise and can handle audio recorded in a variety of situations (e.g.multiple people speaking). Recent approaches employ end-to-end deep learningtaking into account both audio and text as input to generate facial animationfor the whole face. However, scarcity of publicly available expressive audio-3Dfacial animation datasets poses a major bottleneck. The resulting animationsstill have issues regarding accurate lip-synching, expressivity,person-specific information and generalizability. We effectively employself-supervised pretrained HuBERT model in the training process that allows usto incorporate both lexical and non-lexical information in the audio withoutusing a large lexicon. Additionally, guiding the training with a binary emotioncondition and speaker identity distinguishes the tiniest subtle facial motion.We carried out extensive objective and subjective evaluation in comparison toground-truth and state-of-the-art work. A perceptual user study demonstratesthat our approach produces superior results with respect to the realism of theanimation 78% of the time in comparison to the state-of-the-art. In addition,our method is 4 times faster eliminating the use of complex sequential modelssuch as transformers. We strongly recommend watching the supplementary videobefore reading the paper. We also provide the implementation and evaluationcodes with a GitHub repository link.