2 months ago

Generating Holistic 3D Human Motion from Speech

Yi, Hongwei ; Liang, Hualin ; Liu, Yifei ; Cao, Qiong ; Wen, Yandong ; Bolkart, Timo ; Tao, Dacheng ; Black, Michael J.

Abstract

This work addresses the problem of generating 3D holistic body motions fromhuman speech. Given a speech recording, we synthesize sequences of 3D bodyposes, hand gestures, and facial expressions that are realistic and diverse. Toachieve this, we first build a high-quality dataset of 3D holistic body mesheswith synchronous speech. We then define a novel speech-to-motion generationframework in which the face, body, and hands are modeled separately. Theseparated modeling stems from the fact that face articulation stronglycorrelates with human speech, while body poses and hand gestures are lesscorrelated. Specifically, we employ an autoencoder for face motions, and acompositional vector-quantized variational autoencoder (VQ-VAE) for the bodyand hand motions. The compositional VQ-VAE is key to generating diverseresults. Additionally, we propose a cross-conditional autoregressive model thatgenerates body poses and hand gestures, leading to coherent and realisticmotions. Extensive experiments and user studies demonstrate that our proposedapproach achieves state-of-the-art performance both qualitatively andquantitatively. Our novel dataset and code will be released for researchpurposes at https://talkshow.is.tue.mpg.de.