MambaTalk: Efficient Holistic Gesture Synthesis with Selective State Space Models

Gesture synthesis is a vital realm of human-computer interaction, withwide-ranging applications across various fields like film, robotics, andvirtual reality. Recent advancements have utilized the diffusion model andattention mechanisms to improve gesture synthesis. However, due to the highcomputational complexity of these techniques, generating long and diversesequences with low latency remains a challenge. We explore the potential ofstate space models (SSMs) to address the challenge, implementing a two-stagemodeling strategy with discrete motion priors to enhance the quality ofgestures. Leveraging the foundational Mamba block, we introduce MambaTalk,enhancing gesture diversity and rhythm through multimodal integration.Extensive experiments demonstrate that our method matches or exceeds theperformance of state-of-the-art models.