5 months ago

Yikang Ding Jiwen Liu Wenyuan Zhang Zekun Wang Wentao Hu Liyuan Cui Mingming Lao Yingchao Shao Hui Liu Xiaohan Li

Abstract

Recent advances in audio-driven avatar video generation have significantlyenhanced audio-visual realism. However, existing methods treat instructionconditioning merely as low-level tracking driven by acoustic or visual cues,without modeling the communicative purpose conveyed by the instructions. Thislimitation compromises their narrative coherence and character expressiveness.To bridge this gap, we introduce Kling-Avatar, a novel cascaded framework thatunifies multimodal instruction understanding with photorealistic portraitgeneration. Our approach adopts a two-stage pipeline. In the first stage, wedesign a multimodal large language model (MLLM) director that produces ablueprint video conditioned on diverse instruction signals, thereby governinghigh-level semantics such as character motion and emotions. In the secondstage, guided by blueprint keyframes, we generate multiple sub-clips inparallel using a first-last frame strategy. This global-to-local frameworkpreserves fine-grained details while faithfully encoding the high-level intentbehind multimodal instructions. Our parallel architecture also enables fast andstable generation of long-duration videos, making it suitable for real-worldapplications such as digital human livestreaming and vlogging. Tocomprehensively evaluate our method, we construct a benchmark of 375 curatedsamples covering diverse instructions and challenging scenarios. Extensiveexperiments demonstrate that Kling-Avatar is capable of generating vivid,fluent, long-duration videos at up to 1080p and 48 fps, achieving superiorperformance in lip synchronization accuracy, emotion and dynamicexpressiveness, instruction controllability, identity preservation, andcross-domain generalization. These results establish Kling-Avatar as a newbenchmark for semantically grounded, high-fidelity audio-driven avatarsynthesis.

Source PDF

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

HyperAI

5 months ago

Any-to-Any

Multimodal Representation

Multimodal

Multimodality

Task/Problem

Yikang Ding Jiwen Liu Wenyuan Zhang Zekun Wang Wentao Hu Liyuan Cui Mingming Lao Yingchao Shao Hui Liu Xiaohan Li

Abstract

Source PDF

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

HyperAI

5 months ago

Any-to-Any

Multimodal Representation

Multimodal

Multimodality

Task/Problem

Yikang Ding Jiwen Liu Wenyuan Zhang Zekun Wang Wentao Hu Liyuan Cui Mingming Lao Yingchao Shao Hui Liu Xiaohan Li

Abstract

Source PDF

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

Kling-Avatar: Grounding Multimodal Instructions for Cascaded Long-Duration Avatar Animation Synthesis

Yikang Ding Jiwen Liu Wenyuan Zhang Zekun Wang Wentao Hu Liyuan Cui Mingming Lao Yingchao Shao Hui Liu Xiaohan Li4 more

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

Kling-Avatar: Grounding Multimodal Instructions for Cascaded Long-Duration Avatar Animation Synthesis

Yikang Ding Jiwen Liu Wenyuan Zhang Zekun Wang Wentao Hu Liyuan Cui Mingming Lao Yingchao Shao Hui Liu Xiaohan Li4 more

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

Kling-Avatar: Grounding Multimodal Instructions for Cascaded Long-Duration Avatar Animation Synthesis

Yikang Ding Jiwen Liu Wenyuan Zhang Zekun Wang Wentao Hu Liyuan Cui Mingming Lao Yingchao Shao Hui Liu Xiaohan Li4 more

Abstract

Build AI with AI

HyperAI Newsletters

Yikang Ding Jiwen Liu Wenyuan Zhang Zekun Wang Wentao Hu Liyuan Cui Mingming Lao Yingchao Shao Hui Liu Xiaohan Li

Yikang Ding Jiwen Liu Wenyuan Zhang Zekun Wang Wentao Hu Liyuan Cui Mingming Lao Yingchao Shao Hui Liu Xiaohan Li

Yikang Ding Jiwen Liu Wenyuan Zhang Zekun Wang Wentao Hu Liyuan Cui Mingming Lao Yingchao Shao Hui Liu Xiaohan Li