2 months ago

Learning Speech-driven 3D Conversational Gestures from Video

Habibie, Ikhsanul ; Xu, Weipeng ; Mehta, Dushyant ; Liu, Lingjie ; Seidel, Hans-Peter ; Pons-Moll, Gerard ; Elgharib, Mohamed ; Theobalt, Christian

View Paper Details

Learning Speech-driven 3D Conversational Gestures from Video

Abstract

We propose the first approach to automatically and jointly synthesize boththe synchronous 3D conversational body and hand gestures, as well as 3D faceand head animations, of a virtual character from speech input. Our algorithmuses a CNN architecture that leverages the inherent correlation between facialexpression and hand gestures. Synthesis of conversational body gestures is amulti-modal problem since many similar gestures can plausibly accompany thesame input speech. To synthesize plausible body gestures in this setting, wetrain a Generative Adversarial Network (GAN) based model that measures theplausibility of the generated sequences of 3D body motion when paired with theinput audio features. We also contribute a new way to create a large corpus ofmore than 33 hours of annotated body, hand, and face data from in-the-wildvideos of talking people. To this end, we apply state-of-the-art monocularapproaches for 3D body and hand pose estimation as well as dense 3D faceperformance capture to the video corpus. In this way, we can train on orders ofmagnitude more data than previous algorithms that resort to complex in-studiomotion capture solutions, and thereby train more expressive synthesisalgorithms. Our experiments and user study show the state-of-the-art quality ofour speech-synthesized full 3D character animations.