Command Palette
Search for a command to run...
MIDAS: Multimodal Interactive Digital-human Synthesis via Real-time Autoregressive Video Generation
Ming Chen Liyuan Cui Wenyuan Zhang Haoxian Zhang Yan Zhou Xiaohan Li Xiaoqiang Liu Pengfei Wan

Abstract
Recently, interactive digital human video generation has attracted widespreadattention and achieved remarkable progress. However, building such a practicalsystem that can interact with diverse input signals in real time remainschallenging to existing methods, which often struggle with high latency, heavycomputational cost, and limited controllability. In this work, we introduce anautoregressive video generation framework that enables interactive multimodalcontrol and low-latency extrapolation in a streaming manner. With minimalmodifications to a standard large language model (LLM), our framework acceptsmultimodal condition encodings including audio, pose, and text, and outputsspatially and semantically coherent representations to guide the denoisingprocess of a diffusion head. To support this, we construct a large-scaledialogue dataset of approximately 20,000 hours from multiple sources, providingrich conversational scenarios for training. We further introduce a deepcompression autoencoder with up to 64times reduction ratio, whicheffectively alleviates the long-horizon inference burden of the autoregressivemodel. Extensive experiments on duplex conversation, multilingual humansynthesis, and interactive world model highlight the advantages of our approachin low latency, high efficiency, and fine-grained multimodal controllability.
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.