KlingAvatar 2.0 Technischer Bericht
KlingAvatar 2.0 Technischer Bericht
Abstract
Avatar-Video-Generierungsmodelle haben in den letzten Jahren beachtliche Fortschritte erzielt. Dennoch weisen vorangegangene Arbeiten begrenzte Effizienz bei der Erzeugung langer, hochauflösender Videos auf und leiden unter zeitlichem Drift, Qualitätsverschlechterung und schwacher Anweisungskonformität, insbesondere bei zunehmender Video-Länge. Um diese Herausforderungen zu bewältigen, stellen wir KlingAvatar 2.0 vor – einen räumlich-zeitlichen Kaskadenframework, der sowohl in räumlicher Auflösung als auch in der zeitlichen Dimension skaliert. Der Framework generiert zunächst Niedrigauflösungs-Grundriss-Schlüsselbilder, die globale Semantik und Bewegung erfassen, und verfeinert diese anschließend durch eine First-Last-Frame-Strategie zu hochauflösenden, zeitlich kohärenten Unterkapiteln, wobei glatte zeitliche Übergänge in langen Videos beibehalten werden. Um die Fusion und Ausrichtung zwischen Modalitäten in erweiterten Videos zu verbessern, führen wir einen Co-Reasoning Director ein, bestehend aus drei modality-spezifischen großen Sprachmodellen (LLM-Experten). Diese Experte analysieren die Modalitätenprioritäten und leiten die zugrundeliegende Benutzerabsicht ab, wodurch Eingaben durch mehrere Dialogrunden in detaillierte Handlungsstränge umgewandelt werden. Ein zusätzlicher Negative Director verfeinert negative Anweisungen, um die Anweisungskonformität weiter zu steigern. Aufbauend auf diesen Komponenten erweitern wir den Framework, um ID-spezifische Steuerung mehrerer Charaktere zu unterstützen. Umfangreiche Experimente zeigen, dass unser Modell die Herausforderungen der effizienten, multimodalen Ausrichtung bei der Generierung langer, hochauflösender Videos effektiv bewältigt, wodurch eine verbesserte visuelle Klarheit, realistische Lippen-Zahn-Darstellung mit präziser Lippen-Synchronisation, starke Identitätsbewahrung sowie kohärente multimodale Anweisungsausführung erreicht werden.
One-sentence Summary
Kling Team, Kuaishou Technology propose KlingAvatar 2.0, a spatio-temporal cascade framework with a Co-Reasoning Director and negative prompt refinement for efficient, long-form, high-resolution audio-driven avatar video generation, enabling identity-preserving, multimodally aligned, and multi-character control with enhanced visual fidelity, lip synchronization, and instruction following.
Key Contributions
-
We propose a spatial-temporal cascade framework that generates long-duration, high-resolution avatar videos efficiently by first producing low-resolution blueprint keyframes and then progressively refining them into detailed, temporally coherent sub-clips, effectively reducing temporal drifting and visual degradation while maintaining smooth transitions.
-
We introduce a Co-Reasoning Director composed of modality-specific LLM experts that engage in multi-turn dialogue to infer user intent, resolve modality conflicts, and generate coherent shot-level storylines, while a negative director enhances instruction alignment by refining fine-grained negative prompts for improved semantic accuracy.
-
Our framework supports ID-specific multi-character control via mask-controlled audio injection using deep DiT block features and ID-aware attention, enabling synchronized, individualized animations in complex conversational scenarios, with extensive evaluation showing superior performance in visual quality, lip synchronization, identity preservation, and multimodal instruction following on a large-scale cinematic dataset.
Introduction
Audio-driven avatar video generation enables lifelike, expressive digital humans with synchronized facial expressions, lip movements, and body gestures, with applications in education, entertainment, and virtual assistants. While prior methods have advanced from basic lip-sync to full-body animation using diffusion models, they struggle with long-duration, high-resolution synthesis due to temporal drifting, visual degradation, and poor alignment with complex multimodal instructions. Existing approaches often fail to maintain coherence across extended sequences or handle multi-character interactions with individual audio control. The authors introduce KlingAvatar 2.0, a spatio-temporal cascade framework that first generates low-resolution blueprint keyframes for global motion and semantics, then refines them into high-resolution, temporally coherent sub-clips using a first-last frame conditioning strategy. To improve instruction adherence, they propose a Co-Reasoning Director—a multi-turn dialogue system with modality-specific LLM experts that resolve conflicts and generate detailed storylines, complemented by a negative director that enhances fine-grained prompt refinement. The framework further enables identity-specific multi-character control via mask-aware audio injection using deep DiT features. Together, these innovations enable efficient, high-fidelity, long-form video generation with strong identity preservation, accurate lip-speech synchronization, and robust multimodal alignment.
Method
The authors leverage a spatial-temporal cascade diffusion framework to enable high-fidelity, long-form digital human video generation with accurate lip synchronization and fine-grained control over multiple speakers. This framework operates through a hierarchical pipeline that integrates global planning with local refinement, as illustrated in the overall system diagram. The process begins with multimodal inputs—reference images, audio, and textual instructions—fed into a Co-Reasoning Multimodal Large Language Model (MLLM) Director. This director orchestrates a multi-turn dialogue among three specialized experts: an audio-centric expert analyzing speech content and paralinguistic cues, a visual expert extracting appearance and scene context, and a textual expert interpreting user instructions and synthesizing a coherent storyline. The collaborative reasoning resolves ambiguities and generates structured positive and negative storylines that guide the subsequent synthesis stages.

The spatial-temporal cascade begins with a low-resolution video diffusion model (Low-Res Video DiT) that generates a blueprint video capturing the global dynamics, content, and layout of the scene. This initial output is composed of keyframes that represent the overall motion and structure. These keyframes are then processed by a high-resolution DiT to enrich fine details while preserving identity and scene composition, guided by the Co-Reasoning Director’s global prompts. The high-resolution anchor keyframes are subsequently expanded into audio-synchronized sub-clips using a low-resolution video diffusion model conditioned on the first and last frames. This step ensures temporal coherence and lip synchronization, with the prompts augmented by the blueprint keyframes to refine motion and expression. An audio-aware interpolation strategy is applied to synthesize transition frames, enhancing spatial consistency and temporal connectivity. Finally, a high-resolution video diffusion model performs super-resolution on the low-resolution sub-clips, producing high-fidelity, temporally coherent video segments.

To support multi-character scenarios, the system incorporates a mask-prediction head attached to deep DiT features, which predicts segmentation masks to gate identity-specific audio injection into corresponding regions. This enables precise control over individual characters’ lip movements and expressions. The pipeline processes audio and visual inputs for each character through dedicated encoders, with the Human Video DiT generating intermediate representations that are refined by a Mask Prediction MLP. The resulting outputs are passed through a series of modules including DWPose, YOLO, and SAM 2 to produce a final multi-character video. This modular design ensures that each character’s motion and appearance are accurately synchronized with their respective audio input while maintaining overall scene consistency.
Experiment
- Evaluated trajectory-preserving and distribution matching distillation methods, selecting trajectory-preserving distillation for superior balance of performance, stability, and inference efficiency; enhanced with customized time schedulers and a multi-task distillation paradigm, achieving synergistic improvements in generative quality.
- Conducted human preference-based subjective evaluation on 300 test cases (100 Chinese, 100 English, 100 singing) using GSB pairwise comparisons, with (G+S)/(B+S) as the primary metric and detailed assessments across face-lip synchronization, visual quality, motion quality, motion expressiveness, and text relevance.
- Outperformed three baselines—HeyGen, Kling-Avatar, and OmniHuman-1.5—on all dimensions, with significant gains in motion expressiveness and text relevance; generated more natural hair dynamics, physically plausible head poses, and accurate camera trajectories aligned with prompts.
- Achieved superior multimodal alignment, including precise lip synchronization, emotionally coherent gestures, and correct execution of fine-grained actions (e.g., folding hands in front of chest), outperforming baselines in both single-speaker and multi-person interaction scenarios.
- Introduced a shot-specific negative director with dynamic, context-aware negative prompts, enabling fine-grained control over artifacts and narrative inconsistencies, resulting in more stable, natural, and emotionally faithful video generation.
Results show that KlingAvatar 2.0 outperforms all three baselines across most evaluation metrics, with particularly strong gains in motion expressiveness and text relevance. The model achieves higher scores than HeyGen, Kling-Avatar, and OmniHuman-1.5 in overall preference, face-lip synchronization, visual quality, motion quality, and text relevance, indicating superior multimodal alignment and generative performance.

Results show that KlingAvatar 2.0 outperforms all baseline methods across all evaluation dimensions, achieving the highest scores in overall preference, face-lip synchronization, visual quality, motion quality, motion expressiveness, and text relevance. The model demonstrates particularly strong improvements in motion expressiveness and text relevance, with the highest scores in these categories compared to HeyGen, Kling-Avatar, and OmniHuman-1.5.

Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.