EMAGE: Towards Unified Holistic Co-Speech Gesture Generation via Expressive Masked Audio Gesture Modeling

We propose EMAGE, a framework to generate full-body human gestures from audioand masked gestures, encompassing facial, local body, hands, and globalmovements. To achieve this, we first introduce BEAT2 (BEAT-SMPLX-FLAME), a newmesh-level holistic co-speech dataset. BEAT2 combines a MoShed SMPL-X body withFLAME head parameters and further refines the modeling of head, neck, andfinger movements, offering a community-standardized, high-quality 3D motioncaptured dataset. EMAGE leverages masked body gesture priors during training toboost inference performance. It involves a Masked Audio Gesture Transformer,facilitating joint training on audio-to-gesture generation and masked gesturereconstruction to effectively encode audio and body gesture hints. Encoded bodyhints from masked gestures are then separately employed to generate facial andbody movements. Moreover, EMAGE adaptively merges speech features from theaudio's rhythm and content and utilizes four compositional VQ-VAEs to enhancethe results' fidelity and diversity. Experiments demonstrate that EMAGEgenerates holistic gestures with state-of-the-art performance and is flexiblein accepting predefined spatial-temporal gesture inputs, generating complete,audio-synchronized results. Our code and dataset are availablehttps://pantomatrix.github.io/EMAGE/