تقرير فني عن KlingAvatar 2.0
تقرير فني عن KlingAvatar 2.0
Abstract
أظهرت نماذج إنشاء مقاطع الفيديو ذات الشخصيات الافتراضية تقدماً ملحوظاً في السنوات الأخيرة. ومع ذلك، تعاني النماذج السابقة من كفاءة محدودة في إنتاج مقاطع فيديو طويلة ذات دقة عالية، حيث تتأثر بمشاكل مثل الانحراف الزمني، وتدهور الجودة، وضعف الالتزام بالتعليمات مع زيادة طول الفيديو. ولحل هذه التحديات، نقترح "كلينجAvatar 2.0"، وهي إطار عمل متسلسل فضائي-زمني يقوم بتوسيع الدقة المكانية والبعد الزمني في آن واحد. يبدأ الإطار بإنشاء إطارات رئيسية منخفضة الدقة لفيديو المخطط الأولي، التي تلتقط المعاني الشاملة والحركة، ثم يُحسّن هذه الإطارات إلى مقاطع فرعية عالية الدقة ومتماسكة زمنياً باستخدام استراتيجية الإطار الأول والإطار الأخير، مع الحفاظ على انتقالات زمنية سلسة في مقاطع الفيديو الطويلة. ولتعزيز دمج ومحاذاة التعليمات عبر الوسائط في مقاطع الفيديو الطويلة، نقدّم "مُدير التفكير المشترك" (Co-Reasoning Director)، الذي يتكون من ثلاثة خبراء نماذج لغوية كبيرة (LLM) مخصصة للوسائط المختلفة. يقوم هؤلاء الخبراء بتقييم أولويات الوسائط واستخلاص النية الكامنة للمستخدم، وتحويل المدخلات إلى سيناريوهات تفصيلية من خلال محادثات متعددة الدورات. كما يُحسّن "مُدير السلبيات" (Negative Director) التعليمات السلبية لتعزيز التوافق مع التعليمات. وباستناد إلى هذه المكونات، نوسع الإطار لدعم التحكم المحدد بالهوية في أكثر من شخصية واحدة. أظهرت التجارب الواسعة أن نموذجنا يعالج بفعالية التحديات المتعلقة بإنتاج مقاطع فيديو طويلة ذات دقة عالية بكفاءة، ومواءمة متعددة الوسائط، حيث يوفر وضوحًا بصريًا محسّنًا، وتمثيلاً واقعياً للشفاه والأسنان مع محاكاة دقيقة للحركة الشفوية، وحفظًا قويًا للهوية، ومتابعة متماسكة للتعليمات متعددة الوسائط.
One-sentence Summary
Kling Team, Kuaishou Technology propose KlingAvatar 2.0, a spatio-temporal cascade framework with a Co-Reasoning Director and negative prompt refinement for efficient, long-form, high-resolution audio-driven avatar video generation, enabling identity-preserving, multimodally aligned, and multi-character control with enhanced visual fidelity, lip synchronization, and instruction following.
Key Contributions
-
We propose a spatial-temporal cascade framework that generates long-duration, high-resolution avatar videos efficiently by first producing low-resolution blueprint keyframes and then progressively refining them into detailed, temporally coherent sub-clips, effectively reducing temporal drifting and visual degradation while maintaining smooth transitions.
-
We introduce a Co-Reasoning Director composed of modality-specific LLM experts that engage in multi-turn dialogue to infer user intent, resolve modality conflicts, and generate coherent shot-level storylines, while a negative director enhances instruction alignment by refining fine-grained negative prompts for improved semantic accuracy.
-
Our framework supports ID-specific multi-character control via mask-controlled audio injection using deep DiT block features and ID-aware attention, enabling synchronized, individualized animations in complex conversational scenarios, with extensive evaluation showing superior performance in visual quality, lip synchronization, identity preservation, and multimodal instruction following on a large-scale cinematic dataset.
Introduction
Audio-driven avatar video generation enables lifelike, expressive digital humans with synchronized facial expressions, lip movements, and body gestures, with applications in education, entertainment, and virtual assistants. While prior methods have advanced from basic lip-sync to full-body animation using diffusion models, they struggle with long-duration, high-resolution synthesis due to temporal drifting, visual degradation, and poor alignment with complex multimodal instructions. Existing approaches often fail to maintain coherence across extended sequences or handle multi-character interactions with individual audio control. The authors introduce KlingAvatar 2.0, a spatio-temporal cascade framework that first generates low-resolution blueprint keyframes for global motion and semantics, then refines them into high-resolution, temporally coherent sub-clips using a first-last frame conditioning strategy. To improve instruction adherence, they propose a Co-Reasoning Director—a multi-turn dialogue system with modality-specific LLM experts that resolve conflicts and generate detailed storylines, complemented by a negative director that enhances fine-grained prompt refinement. The framework further enables identity-specific multi-character control via mask-aware audio injection using deep DiT features. Together, these innovations enable efficient, high-fidelity, long-form video generation with strong identity preservation, accurate lip-speech synchronization, and robust multimodal alignment.
Method
The authors leverage a spatial-temporal cascade diffusion framework to enable high-fidelity, long-form digital human video generation with accurate lip synchronization and fine-grained control over multiple speakers. This framework operates through a hierarchical pipeline that integrates global planning with local refinement, as illustrated in the overall system diagram. The process begins with multimodal inputs—reference images, audio, and textual instructions—fed into a Co-Reasoning Multimodal Large Language Model (MLLM) Director. This director orchestrates a multi-turn dialogue among three specialized experts: an audio-centric expert analyzing speech content and paralinguistic cues, a visual expert extracting appearance and scene context, and a textual expert interpreting user instructions and synthesizing a coherent storyline. The collaborative reasoning resolves ambiguities and generates structured positive and negative storylines that guide the subsequent synthesis stages.

The spatial-temporal cascade begins with a low-resolution video diffusion model (Low-Res Video DiT) that generates a blueprint video capturing the global dynamics, content, and layout of the scene. This initial output is composed of keyframes that represent the overall motion and structure. These keyframes are then processed by a high-resolution DiT to enrich fine details while preserving identity and scene composition, guided by the Co-Reasoning Director’s global prompts. The high-resolution anchor keyframes are subsequently expanded into audio-synchronized sub-clips using a low-resolution video diffusion model conditioned on the first and last frames. This step ensures temporal coherence and lip synchronization, with the prompts augmented by the blueprint keyframes to refine motion and expression. An audio-aware interpolation strategy is applied to synthesize transition frames, enhancing spatial consistency and temporal connectivity. Finally, a high-resolution video diffusion model performs super-resolution on the low-resolution sub-clips, producing high-fidelity, temporally coherent video segments.

To support multi-character scenarios, the system incorporates a mask-prediction head attached to deep DiT features, which predicts segmentation masks to gate identity-specific audio injection into corresponding regions. This enables precise control over individual characters’ lip movements and expressions. The pipeline processes audio and visual inputs for each character through dedicated encoders, with the Human Video DiT generating intermediate representations that are refined by a Mask Prediction MLP. The resulting outputs are passed through a series of modules including DWPose, YOLO, and SAM 2 to produce a final multi-character video. This modular design ensures that each character’s motion and appearance are accurately synchronized with their respective audio input while maintaining overall scene consistency.
Experiment
- Evaluated trajectory-preserving and distribution matching distillation methods, selecting trajectory-preserving distillation for superior balance of performance, stability, and inference efficiency; enhanced with customized time schedulers and a multi-task distillation paradigm, achieving synergistic improvements in generative quality.
- Conducted human preference-based subjective evaluation on 300 test cases (100 Chinese, 100 English, 100 singing) using GSB pairwise comparisons, with (G+S)/(B+S) as the primary metric and detailed assessments across face-lip synchronization, visual quality, motion quality, motion expressiveness, and text relevance.
- Outperformed three baselines—HeyGen, Kling-Avatar, and OmniHuman-1.5—on all dimensions, with significant gains in motion expressiveness and text relevance; generated more natural hair dynamics, physically plausible head poses, and accurate camera trajectories aligned with prompts.
- Achieved superior multimodal alignment, including precise lip synchronization, emotionally coherent gestures, and correct execution of fine-grained actions (e.g., folding hands in front of chest), outperforming baselines in both single-speaker and multi-person interaction scenarios.
- Introduced a shot-specific negative director with dynamic, context-aware negative prompts, enabling fine-grained control over artifacts and narrative inconsistencies, resulting in more stable, natural, and emotionally faithful video generation.
Results show that KlingAvatar 2.0 outperforms all three baselines across most evaluation metrics, with particularly strong gains in motion expressiveness and text relevance. The model achieves higher scores than HeyGen, Kling-Avatar, and OmniHuman-1.5 in overall preference, face-lip synchronization, visual quality, motion quality, and text relevance, indicating superior multimodal alignment and generative performance.

Results show that KlingAvatar 2.0 outperforms all baseline methods across all evaluation dimensions, achieving the highest scores in overall preference, face-lip synchronization, visual quality, motion quality, motion expressiveness, and text relevance. The model demonstrates particularly strong improvements in motion expressiveness and text relevance, with the highest scores in these categories compared to HeyGen, Kling-Avatar, and OmniHuman-1.5.

Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.