a month ago

SpeakerVid-5M: A Large-Scale High-Quality Dataset for Audio-Visual Dyadic Interactive Human Generation

Youliang Zhang, Zhaoyang Li, Duomin Wang, Jiahe Zhang, Deyu Zhou, Zixin Yin, Xili Dai, Gang Yu, Xiu Li

Abstract

The rapid development of large-scale models has catalyzed significantbreakthroughs in the digital human domain. These advanced methodologies offerhigh-fidelity solutions for avatar driving and rendering, leading academia tofocus on the next major challenge: audio-visual dyadic interactive virtualhuman. To facilitate research in this emerging area, we present SpeakerVid-5Mdataset, the first large-scale, high-quality dataset designed for audio-visualdyadic interactive virtual human generation. Totaling over 8,743 hours,SpeakerVid-5M contains more than 5.2 million video clips of human portraits. Itcovers diverse scales and interaction types, including monadic talking,listening, and dyadic conversations. Crucially, the dataset is structured alongtwo key dimensions: interaction type and data quality. First, it is categorizedinto four types (dialogue branch, single branch, listening branch andmulti-turn branch) based on the interaction scenario. Second, it is stratifiedinto a large-scale pre-training subset and a curated, high-quality subset forSupervised Fine-Tuning (SFT). This dual structure accommodates a wide array of2D virtual human tasks. In addition, we provide an autoregressive (AR)-basedvideo chat baseline trained on this data, accompanied by a dedicated set ofmetrics and test data to serve as a benchmark VidChatBench for future work.Both the dataset and the corresponding data processing code will be publiclyreleased. Project page: https://dorniwang.github.io/SpeakerVid-5M/