HyperAIHyperAI

Command Palette

Search for a command to run...

2 months ago

MIDAS: Multimodal Interactive Digital-human Synthesis via Real-time Autoregressive Video Generation

Ming Chen Liyuan Cui Wenyuan Zhang Haoxian Zhang Yan Zhou Xiaohan Li Xiaoqiang Liu Pengfei Wan

MIDAS: Multimodal Interactive Digital-human Synthesis via Real-time
  Autoregressive Video Generation

Abstract

Recently, interactive digital human video generation has attracted widespreadattention and achieved remarkable progress. However, building such a practicalsystem that can interact with diverse input signals in real time remainschallenging to existing methods, which often struggle with high latency, heavycomputational cost, and limited controllability. In this work, we introduce anautoregressive video generation framework that enables interactive multimodalcontrol and low-latency extrapolation in a streaming manner. With minimalmodifications to a standard large language model (LLM), our framework acceptsmultimodal condition encodings including audio, pose, and text, and outputsspatially and semantically coherent representations to guide the denoisingprocess of a diffusion head. To support this, we construct a large-scaledialogue dataset of approximately 20,000 hours from multiple sources, providingrich conversational scenarios for training. We further introduce a deepcompression autoencoder with up to 64times reduction ratio, whicheffectively alleviates the long-horizon inference burden of the autoregressivemodel. Extensive experiments on duplex conversation, multilingual humansynthesis, and interactive world model highlight the advantages of our approachin low latency, high efficiency, and fine-grained multimodal controllability.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
MIDAS: Multimodal Interactive Digital-human Synthesis via Real-time Autoregressive Video Generation | Papers | HyperAI