HyperAIHyperAI

Command Palette

Search for a command to run...

Klear: 통합 다중 작업 음성-영상 공동 생성

Jun Wang Chunyu Qiang Yuxin Guo Yiran Wang Xijuan Zeng Chen Zhang Pengfei Wan

초록

음성-영상 공동 생성 기술은 빠르게 발전하고 있으나, 여전히 많은 도전 과제가 남아 있다. 상용화되지 않은 기존 접근 방식은 여전히 음성-영상 간 비동기성, 입술 움직임과 음성 간의 정확한 일치 부족, 단일 모달(degradation) 문제를 겪고 있으며, 이는 약한 음성-영상 대응 모델링, 제한된 일반화 능력, 그리고 고품질의 밀도 높은 캡션 데이터 부족에서 기인한다. 이러한 문제를 해결하기 위해 우리는 Klear을 제안하고, 모델 아키텍처, 훈련 전략, 데이터 정제의 세 가지 축을 중심으로 탐구한다. 아키텍처 측면에서, 통합된 DiT 블록과 옴니-풀 어텐션(omni-full attention) 메커니즘을 채택한 단일 타워(single-tower) 설계를 도입하여, 음성-영상 간의 밀접한 정렬과 뛰어난 확장성을 달성한다. 훈련 전략 측면에서는 점진적 다중 작업(Progressive Multitask) 방식을 채택하며, 임의의 모달 마스킹을 통한 작업 간 공동 최적화와 다단계 커리큘럼을 도입함으로써 강력한 표현 능력, 음성-영상 일치 세계 지식의 강화, 단일 모달 붕괴 방지가 가능해진다. 데이터 측면에서는, 밀도 높은 캡션을 보유한 최초의 대규모 음성-영상 데이터셋을 제시하고, 수백만 개의 다양한 고품질, 엄격히 정렬된 음성-영상-캡션 삼중 쌍을 자동으로 주석화하고 필터링하는 혁신적인 데이터 구축 파이프라인을 도입한다. 이러한 기반 위에서 Klear은 대규모 데이터셋에 확장 가능하며, 공동 생성 및 단일 모달 설정 모두에서 고해상도, 의미적·시계열적으로 정렬된 지시어 기반 생성을 제공하며, 분포 외 시나리오에도 강력한 일반화 능력을 보인다. 다양한 작업에서 기존 방법 대비 큰 성능 향상을 달성하였으며, Veo 3 수준의 성능을 기록함으로써 차세대 음성-영상 합성 기술을 위한 통합적이고 확장 가능한 길을 제시한다.

One-sentence Summary

The authors from Kuaishou Technology propose KLEAR, a unified single-tower audio-video generation framework with Omni-Full Attention and progressive multitask training, enabling high-fidelity, temporally aligned, and instruction-following synthesis across joint and unimodal tasks, achieving performance comparable to Veo 3 while overcoming prior limitations in audio-visual synchronization and unimodal degradation through a large-scale, densely captioned dataset and scalable training strategy.

Key Contributions

  • We introduce KLEAR, a unified multi-task audio-video generation framework that achieves high-fidelity, semantically and temporally aligned outputs in both joint and unimodal settings, with performance comparable to Veo 3, addressing persistent issues like audio-visual asynchrony and lip-speech misalignment.

  • The framework features a single-tower architecture with unified DiT blocks and an Omni-Full Attention mechanism that jointly attends to audio, video, and their corresponding captions, enabling deep cross-modal fusion and strong alignment, while a progressive multitask training strategy with random modality masking prevents unimodal collapse and enhances generalization.

  • We present the first large-scale audio-video dataset with dense captions—81 million high-quality, strictly aligned triplets—generated via an automated pipeline, which enables robust training and demonstrates strong out-of-distribution generalization across benchmarks.

Introduction

The authors leverage recent advances in generative AI to address persistent challenges in audio-video joint generation, where models often suffer from audio-visual asynchrony, poor lip-speech alignment, and degradation in unimodal outputs. Prior work is limited by weak cross-modal interaction due to suboptimal architectures—such as dual-tower designs with shallow fusion—lack of diverse, high-quality training data, and single-task training regimes that induce bias and hinder generalization. To overcome these, the authors introduce KLEAR, a unified multi-task framework featuring a single-tower architecture with unified DiT blocks and an Omni-Full Attention mechanism that jointly models audio, video, and their corresponding captions for tight spatio-temporal alignment. They employ a progressive multitask training strategy with random modality masking and a performance-adaptive curriculum to enhance representation robustness and prevent unimodal collapse. Additionally, they introduce a large-scale, high-quality dataset of 81 million dense-captioned audio-video triplets, generated via an automated pipeline. KLEAR achieves state-of-the-art performance across joint and unimodal tasks, matching Veo 3 in quality while demonstrating strong out-of-distribution generalization.

Dataset

  • The dataset is composed of automatically annotated audio-visual samples, including single-speaker speech, multi-speaker speech, singing, and natural sound clips, with a final post-filtering retention rate of 27%.
  • Video filtering is based on dynamic quality (motion ratio, camera stability), static quality (sharpness, aesthetics, color saturation), content naturalness (no watermarks or excessive effects), and safety; low-resolution, low SNR/MOS, or high-silence videos (>20%) are discarded. Scene splitting ensures each sample contains only one coherent scene.
  • Audio filtering removes low SNR, poor MOS, clipped, distorted, or noisy samples, enforces less than 20% silence, and ensures high fidelity and consistent formatting. Audio-visual alignment is verified using Synchformer (temporal) and ImageBind (semantic) to ensure strong synchronization.
  • The dataset is split by audio type: vocal and non-vocal. From the vocal subset, three distinct splits are created—singing, single-speaker speech, and multi-speaker speech—each of which undergoes dense captioning.
  • Each split is annotated using specialized models: Whisper-Large-v3, SenseVoice, and Qwen2.5-Omni for speech and singing transcripts; Qwen2.5-Omni and Gemini 2.5-Pro for audio captions; and a video expert model for detailed video descriptions. Speaker attributes (gender, age) are extracted for vocal content.
  • All annotations are integrated into unified dense captions, forming a richly labeled dataset.
  • The authors use this dataset for training, combining the splits with tailored mixture ratios to balance representation across speech, singing, and sound categories, ensuring diverse and high-quality input for model training.

Method

The authors leverage a unified single-tower architecture to enable joint audio-video generation, addressing the limitations of cascaded and dual-tower approaches. The model, named KLEAR, employs a multimodal diffusion transformer (MM-DiT) as its core backbone, which processes inputs from four modalities: video, video-related text, audio-related text, and audio. Each modality is individually encoded into latent representations using dedicated encoders—video via a 3D causal visual encoder, and text and audio via respective embedding models. These encoded sequences are then fed into the MM-DiT module, which generates latent variables for both video and audio in separate streams. The generated latents are subsequently decoded independently to produce the final audio and video outputs. Refer to the framework diagram for a visual overview of this process.

The MM-DiT module utilizes a full-attention mechanism to facilitate comprehensive cross-modal interaction. Specifically, the hidden states of video, video-related text, audio-related text, and audio are scaled, normalized, and concatenated for attention computation. The attention mechanism computes query, key, and value matrices for each modality, which are then combined to form the attention output. This is expressed as Q=QVQVTQATQAQ = Q_V \odot Q_{VT} \odot Q_{AT} \odot Q_AQ=QVQVTQATQA, K=KVKVTKATKAK = K_V \odot K_{VT} \odot K_{AT} \odot K_AK=KVKVTKATKA, and V=VVVVTVATVAV = V_V \odot V_{VT} \odot V_{AT} \odot V_AV=VVVVTVATVA, where the \odot operator denotes concatenation. The attention output is calculated as Atn(Q,K,V)=Softmax(QKdk)VAtn(Q, K, V) = \text{Softmax}(\frac{QK^\top}{\sqrt{d_k}})VAtn(Q,K,V)=Softmax(dkQK)V. The resulting attention values are split back into separate modalities, undergo scaling, normalization, residual connection, and feedforward processing, and are then passed to the next MM-DiT block. This approach ensures that all modalities are unified within a joint full-attention framework, enabling effective fusion.

To enhance positional encoding, the model incorporates Mixed Dimension Rotary Position Embedding (MixD-RoPE). For video, a 3D RoPE is applied across temporal, width, and height dimensions, capturing both absolute and relative position dependencies. For audio, compatible 1D temporal positional encodings are used, with the position IDs initialized by incrementing the maximum temporal position ID of the video modality. This design ensures a shared temporal position ID between video and audio, facilitating synchronized processing. The model is trained using a flow-matching objective, where the denoising network ϵθ()\epsilon_\theta(\cdot)ϵθ() learns to predict the velocity field that transforms pure Gaussian noise to the data distribution. The training loss is defined as LFM=Et,c,x0,x1(x1x0)ϵθ(tx1+(1t)x0,t,c)22\mathcal{L}_{\text{FM}} = \mathbb{E}_{t, c, x_0, x_1} \left\| (x_1 - x_0) - \epsilon_\theta(t x_1 + (1 - t) x_0, t, c) \right\|_2^2LFM=Et,c,x0,x1(x1x0)ϵθ(tx1+(1t)x0,t,c)22, with tU(0,1)t \sim \mathcal{U}(0, 1)tU(0,1), x0N(0,I)x_0 \sim \mathcal{N}(0, \mathbf{I})x0N(0,I), and x1pdatax_1 \sim p_{\text{data}}x1pdata.

Experiment

  • KLEAR validates its effectiveness through comprehensive experiments across multiple tasks, demonstrating state-of-the-art performance in audio-video joint generation, unimodal quality, and cross-modal consistency.
  • On TI2AV, TI2V, T2V, and T2A tasks, KLEAR surpasses task-specialized baselines, achieving 34% higher unimodal quality than cascaded methods and 18% higher than joint baselines, while matching or exceeding specialized models.
  • Qualitative results show superior lip-sync accuracy, emotional expressiveness, singing/rap performance, and audio-visual synchronization, with KLEAR achieving phoneme-level alignment and natural prosody fusion, outperforming Universe-1 and Ovi.
  • Ablations confirm the single-tower architecture with omni full attention outperforms dual-tower designs, with better cross-modal alignment and robustness despite distribution mismatch in pretrained towers.
  • Multi-task masking improves cross-modal correlation and generalization, enabling strong performance on downstream tasks like I2V and I2AV.
  • Progressive training strategy significantly enhances model capabilities, with post-training on high-quality data yielding additional gains, and removing the schedule causing notable performance drops.

The authors use a unified single-tower architecture with omni full attention to achieve superior audio-video consistency and unimodal performance across multiple tasks. Results show that their approach outperforms both cascaded and joint baselines, with the "All Tasks (Ours)" method achieving the highest scores in video quality, audio quality, and audio-video synchronization.

The authors compare a dual-tower and a single-tower architecture for audio-video generation, with the single-tower model achieving superior performance across all metrics. Results show the single-tower approach outperforms the dual-tower variant in video quality, audio quality, and audio-video consistency, demonstrating the effectiveness of the unified architecture and omni full attention mechanism.

Results show that KLEAR achieves state-of-the-art performance across multiple audio-video generation tasks, outperforming prior methods in video quality, audio quality, and audio-visual consistency. The unified T2AV framework with omni full attention enables superior cross-modal alignment, as evidenced by higher scores in metrics such as MS, AS, ID, and IB-Score compared to cascaded and dual-tower baselines.

The authors use the provided charts to evaluate the impact of different training stages on model performance across multiple metrics. Results show that the post-train-quality stage consistently improves all evaluated metrics—video identity, audio CLAP score, TTS WER, and AV-consistency—compared to earlier stages, indicating that high-quality data and progressive training significantly enhance model performance.


AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩
바로 사용 가능한 GPU
최적의 가격

HyperAI Newsletters

최신 정보 구독하기
한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다
이메일 서비스 제공: MailChimp