HyperAIHyperAI

Command Palette

Search for a command to run...

Console

Rapport technique KlingAvatar 2.0

Abstract

Les modèles de génération de vidéos Avatar ont connu des progrès remarquables ces dernières années. Toutefois, les travaux antérieurs présentent une efficacité limitée dans la génération de vidéos longues à haute résolution, souffrant de dérives temporelles, de dégradation de qualité et d'une faible fidélité aux instructions lorsque la durée de la vidéo augmente. Pour relever ces défis, nous proposons KlingAvatar 2.0, un cadre hiérarchique spatio-temporel qui réalise une mise à l’échelle à la fois en résolution spatiale et en dimension temporelle. Ce cadre génère d’abord des images clés de vidéo à faible résolution capturant les sémantiques globales et le mouvement, puis les affine en sous-segments à haute résolution et cohérents dans le temps grâce à une stratégie basée sur les premières et dernières images, tout en préservant des transitions temporelles fluides dans les vidéos longues. Pour améliorer la fusion et l’alignement multimodaux des instructions dans les vidéos étendues, nous introduisons un Directeur de Co-Raisonnement composé de trois experts spécifiques aux modalités, fondés sur des grands modèles linguistiques (LLM). Ces experts évaluent les priorités des modalités et infèrent l’intention sous-jacente de l’utilisateur, transformant les entrées en scénarios détaillés via un dialogue multi-tours. Un Directeur Négatif supplémentaire affine davantage les instructions négatives afin d’améliorer l’alignement des instructions. En s’appuyant sur ces composants, nous étendons le cadre pour supporter un contrôle multi-personnage spécifique à chaque identité. Des expériences étendues démontrent que notre modèle répond efficacement aux défis liés à la génération de vidéos longues, à haute résolution, bien alignées multimodaux, offrant une clarté visuelle améliorée, une représentation réaliste des lèvres et des dents avec une synchronisation précise des lèvres, une forte préservation de l’identité et une cohérence dans le suivi des instructions multimodales.

One-sentence Summary

Kling Team, Kuaishou Technology propose KlingAvatar 2.0, a spatio-temporal cascade framework with a Co-Reasoning Director and negative prompt refinement for efficient, long-form, high-resolution audio-driven avatar video generation, enabling identity-preserving, multimodally aligned, and multi-character control with enhanced visual fidelity, lip synchronization, and instruction following.

Key Contributions

  • We propose a spatial-temporal cascade framework that generates long-duration, high-resolution avatar videos efficiently by first producing low-resolution blueprint keyframes and then progressively refining them into detailed, temporally coherent sub-clips, effectively reducing temporal drifting and visual degradation while maintaining smooth transitions.

  • We introduce a Co-Reasoning Director composed of modality-specific LLM experts that engage in multi-turn dialogue to infer user intent, resolve modality conflicts, and generate coherent shot-level storylines, while a negative director enhances instruction alignment by refining fine-grained negative prompts for improved semantic accuracy.

  • Our framework supports ID-specific multi-character control via mask-controlled audio injection using deep DiT block features and ID-aware attention, enabling synchronized, individualized animations in complex conversational scenarios, with extensive evaluation showing superior performance in visual quality, lip synchronization, identity preservation, and multimodal instruction following on a large-scale cinematic dataset.

Introduction

Audio-driven avatar video generation enables lifelike, expressive digital humans with synchronized facial expressions, lip movements, and body gestures, with applications in education, entertainment, and virtual assistants. While prior methods have advanced from basic lip-sync to full-body animation using diffusion models, they struggle with long-duration, high-resolution synthesis due to temporal drifting, visual degradation, and poor alignment with complex multimodal instructions. Existing approaches often fail to maintain coherence across extended sequences or handle multi-character interactions with individual audio control. The authors introduce KlingAvatar 2.0, a spatio-temporal cascade framework that first generates low-resolution blueprint keyframes for global motion and semantics, then refines them into high-resolution, temporally coherent sub-clips using a first-last frame conditioning strategy. To improve instruction adherence, they propose a Co-Reasoning Director—a multi-turn dialogue system with modality-specific LLM experts that resolve conflicts and generate detailed storylines, complemented by a negative director that enhances fine-grained prompt refinement. The framework further enables identity-specific multi-character control via mask-aware audio injection using deep DiT features. Together, these innovations enable efficient, high-fidelity, long-form video generation with strong identity preservation, accurate lip-speech synchronization, and robust multimodal alignment.

Method

The authors leverage a spatial-temporal cascade diffusion framework to enable high-fidelity, long-form digital human video generation with accurate lip synchronization and fine-grained control over multiple speakers. This framework operates through a hierarchical pipeline that integrates global planning with local refinement, as illustrated in the overall system diagram. The process begins with multimodal inputs—reference images, audio, and textual instructions—fed into a Co-Reasoning Multimodal Large Language Model (MLLM) Director. This director orchestrates a multi-turn dialogue among three specialized experts: an audio-centric expert analyzing speech content and paralinguistic cues, a visual expert extracting appearance and scene context, and a textual expert interpreting user instructions and synthesizing a coherent storyline. The collaborative reasoning resolves ambiguities and generates structured positive and negative storylines that guide the subsequent synthesis stages.

The spatial-temporal cascade begins with a low-resolution video diffusion model (Low-Res Video DiT) that generates a blueprint video capturing the global dynamics, content, and layout of the scene. This initial output is composed of keyframes that represent the overall motion and structure. These keyframes are then processed by a high-resolution DiT to enrich fine details while preserving identity and scene composition, guided by the Co-Reasoning Director’s global prompts. The high-resolution anchor keyframes are subsequently expanded into audio-synchronized sub-clips using a low-resolution video diffusion model conditioned on the first and last frames. This step ensures temporal coherence and lip synchronization, with the prompts augmented by the blueprint keyframes to refine motion and expression. An audio-aware interpolation strategy is applied to synthesize transition frames, enhancing spatial consistency and temporal connectivity. Finally, a high-resolution video diffusion model performs super-resolution on the low-resolution sub-clips, producing high-fidelity, temporally coherent video segments.

To support multi-character scenarios, the system incorporates a mask-prediction head attached to deep DiT features, which predicts segmentation masks to gate identity-specific audio injection into corresponding regions. This enables precise control over individual characters’ lip movements and expressions. The pipeline processes audio and visual inputs for each character through dedicated encoders, with the Human Video DiT generating intermediate representations that are refined by a Mask Prediction MLP. The resulting outputs are passed through a series of modules including DWPose, YOLO, and SAM 2 to produce a final multi-character video. This modular design ensures that each character’s motion and appearance are accurately synchronized with their respective audio input while maintaining overall scene consistency.

Experiment

  • Evaluated trajectory-preserving and distribution matching distillation methods, selecting trajectory-preserving distillation for superior balance of performance, stability, and inference efficiency; enhanced with customized time schedulers and a multi-task distillation paradigm, achieving synergistic improvements in generative quality.
  • Conducted human preference-based subjective evaluation on 300 test cases (100 Chinese, 100 English, 100 singing) using GSB pairwise comparisons, with (G+S)/(B+S) as the primary metric and detailed assessments across face-lip synchronization, visual quality, motion quality, motion expressiveness, and text relevance.
  • Outperformed three baselines—HeyGen, Kling-Avatar, and OmniHuman-1.5—on all dimensions, with significant gains in motion expressiveness and text relevance; generated more natural hair dynamics, physically plausible head poses, and accurate camera trajectories aligned with prompts.
  • Achieved superior multimodal alignment, including precise lip synchronization, emotionally coherent gestures, and correct execution of fine-grained actions (e.g., folding hands in front of chest), outperforming baselines in both single-speaker and multi-person interaction scenarios.
  • Introduced a shot-specific negative director with dynamic, context-aware negative prompts, enabling fine-grained control over artifacts and narrative inconsistencies, resulting in more stable, natural, and emotionally faithful video generation.

Results show that KlingAvatar 2.0 outperforms all three baselines across most evaluation metrics, with particularly strong gains in motion expressiveness and text relevance. The model achieves higher scores than HeyGen, Kling-Avatar, and OmniHuman-1.5 in overall preference, face-lip synchronization, visual quality, motion quality, and text relevance, indicating superior multimodal alignment and generative performance.

Results show that KlingAvatar 2.0 outperforms all baseline methods across all evaluation dimensions, achieving the highest scores in overall preference, face-lip synchronization, visual quality, motion quality, motion expressiveness, and text relevance. The model demonstrates particularly strong improvements in motion expressiveness and text relevance, with the highest scores in these categories compared to HeyGen, Kling-Avatar, and OmniHuman-1.5.


Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

Hyper Newsletters

Abonnez-vous à nos dernières mises à jour
Nous vous enverrons les dernières mises à jour de la semaine dans votre boîte de réception à neuf heures chaque lundi matin
Propulsé par MailChimp
Rapport technique KlingAvatar 2.0 | Papers | HyperAI