GestureLSM: Latent Shortcut based Co-Speech Gesture Generation with Spatial-Temporal Modeling

Generating full-body human gestures based on speech signals remainschallenges on quality and speed. Existing approaches model different bodyregions such as body, legs and hands separately, which fail to capture thespatial interactions between them and result in unnatural and disjointedmovements. Additionally, their autoregressive/diffusion-based pipelines showslow generation speed due to dozens of inference steps. To address these twochallenges, we propose GestureLSM, a flow-matching-based approach for Co-SpeechGesture Generation with spatial-temporal modeling. Our method i) explicitlymodel the interaction of tokenized body regions through spatial and temporalattention, for generating coherent full-body gestures. ii) introduce the flowmatching to enable more efficient sampling by explicitly modeling the latentvelocity space. To overcome the suboptimal performance of flow matchingbaseline, we propose latent shortcut learning and beta distribution time stampsampling during training to enhance gesture synthesis quality and accelerateinference. Combining the spatial-temporal modeling and improved flowmatching-based framework, GestureLSM achieves state-of-the-art performance onBEAT2 while significantly reducing inference time compared to existing methods,highlighting its potential for enhancing digital humans and embodied agents inreal-world applications. Project Page:https://andypinxinliu.github.io/GestureLSM