MOSPA: Human Motion Generation Driven by Spatial Audio

Enabling virtual humans to dynamically and realistically respond to diverseauditory stimuli remains a key challenge in character animation, demanding theintegration of perceptual modeling and motion synthesis. Despite itssignificance, this task remains largely unexplored. Most previous works haveprimarily focused on mapping modalities like speech, audio, and music togenerate human motion. As of yet, these models typically overlook the impact ofspatial features encoded in spatial audio signals on human motion. To bridgethis gap and enable high-quality modeling of human movements in response tospatial audio, we introduce the first comprehensive Spatial Audio-Driven HumanMotion (SAM) dataset, which contains diverse and high-quality spatial audio andmotion data. For benchmarking, we develop a simple yet effectivediffusion-based generative framework for human MOtion generation driven bySPatial Audio, termed MOSPA, which faithfully captures the relationship betweenbody motion and spatial audio through an effective fusion mechanism. Oncetrained, MOSPA could generate diverse realistic human motions conditioned onvarying spatial audio inputs. We perform a thorough investigation of theproposed dataset and conduct extensive experiments for benchmarking, where ourmethod achieves state-of-the-art performance on this task. Our model anddataset will be open-sourced upon acceptance. Please refer to our supplementaryvideo for more details.