HyperAIHyperAI

Command Palette

Search for a command to run...

2 months ago

Kling-Avatar: Grounding Multimodal Instructions for Cascaded Long-Duration Avatar Animation Synthesis

Kling-Avatar: Grounding Multimodal Instructions for Cascaded
  Long-Duration Avatar Animation Synthesis

Abstract

Recent advances in audio-driven avatar video generation have significantlyenhanced audio-visual realism. However, existing methods treat instructionconditioning merely as low-level tracking driven by acoustic or visual cues,without modeling the communicative purpose conveyed by the instructions. Thislimitation compromises their narrative coherence and character expressiveness.To bridge this gap, we introduce Kling-Avatar, a novel cascaded framework thatunifies multimodal instruction understanding with photorealistic portraitgeneration. Our approach adopts a two-stage pipeline. In the first stage, wedesign a multimodal large language model (MLLM) director that produces ablueprint video conditioned on diverse instruction signals, thereby governinghigh-level semantics such as character motion and emotions. In the secondstage, guided by blueprint keyframes, we generate multiple sub-clips inparallel using a first-last frame strategy. This global-to-local frameworkpreserves fine-grained details while faithfully encoding the high-level intentbehind multimodal instructions. Our parallel architecture also enables fast andstable generation of long-duration videos, making it suitable for real-worldapplications such as digital human livestreaming and vlogging. Tocomprehensively evaluate our method, we construct a benchmark of 375 curatedsamples covering diverse instructions and challenging scenarios. Extensiveexperiments demonstrate that Kling-Avatar is capable of generating vivid,fluent, long-duration videos at up to 1080p and 48 fps, achieving superiorperformance in lip synchronization accuracy, emotion and dynamicexpressiveness, instruction controllability, identity preservation, andcross-domain generalization. These results establish Kling-Avatar as a newbenchmark for semantically grounded, high-fidelity audio-driven avatarsynthesis.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Kling-Avatar: Grounding Multimodal Instructions for Cascaded Long-Duration Avatar Animation Synthesis | Papers | HyperAI