Command Palette
Search for a command to run...
Kling-Avatar: Grounding Multimodal Instructions for Cascaded Long-Duration Avatar Animation Synthesis

Abstract
Recent advances in audio-driven avatar video generation have significantlyenhanced audio-visual realism. However, existing methods treat instructionconditioning merely as low-level tracking driven by acoustic or visual cues,without modeling the communicative purpose conveyed by the instructions. Thislimitation compromises their narrative coherence and character expressiveness.To bridge this gap, we introduce Kling-Avatar, a novel cascaded framework thatunifies multimodal instruction understanding with photorealistic portraitgeneration. Our approach adopts a two-stage pipeline. In the first stage, wedesign a multimodal large language model (MLLM) director that produces ablueprint video conditioned on diverse instruction signals, thereby governinghigh-level semantics such as character motion and emotions. In the secondstage, guided by blueprint keyframes, we generate multiple sub-clips inparallel using a first-last frame strategy. This global-to-local frameworkpreserves fine-grained details while faithfully encoding the high-level intentbehind multimodal instructions. Our parallel architecture also enables fast andstable generation of long-duration videos, making it suitable for real-worldapplications such as digital human livestreaming and vlogging. Tocomprehensively evaluate our method, we construct a benchmark of 375 curatedsamples covering diverse instructions and challenging scenarios. Extensiveexperiments demonstrate that Kling-Avatar is capable of generating vivid,fluent, long-duration videos at up to 1080p and 48 fps, achieving superiorperformance in lip synchronization accuracy, emotion and dynamicexpressiveness, instruction controllability, identity preservation, andcross-domain generalization. These results establish Kling-Avatar as a newbenchmark for semantically grounded, high-fidelity audio-driven avatarsynthesis.
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.