HyperAIHyperAI

Command Palette

Search for a command to run...

HunyuanVideo-Foley: تمايز متعدد الوسائط مع محاذاة التمثيل لإنشاء صوت فولي عالي الوضوح

Sizhe Shan Qiulin Li Yutao Cui Miles Yang Yuehai Wang Qun Yang Jin Zhou Zhao Zhong

الملخص

أحرزت التطورات الحديثة في توليد الفيديو تقدماً ملحوظاً في إنتاج محتوى بصري واقعي، لكن غياب الصوت المتماشي مع الفيديو يُضعف بشكل كبير تجربة الغمر البصري. وللتصدي للتحديات الأساسية في توليد الصوت من الفيديو، مثل ندرة البيانات متعددة الوسائط، وعدم توازن الوسائط، وقيود جودة الصوت في الطرق الحالية، نُقدّم نموذج HunyuanVideo-Foley، وهو إطار عمل من النهاية إلى النهاية يحول النص والفيديو إلى صوت، ويُولّد صوتًا عالي الجودة ومتماشيًا بدقة مع الديناميكيات البصرية والسياق الدلالي. يعتمد نهجنا على ثلاث ابتكارات رئيسية: (1) نظام مُعدّ للبيانات قابل للتوسع، يُجهّز مجموعة بيانات متعددة الوسائط بسعة 100 ألف ساعة من خلال التسمية التلقائية؛ (2) استراتيجية توحيد التمثيل تستخدم ميزات صوتية ذاتية التدريب لتأطير عملية التدريب على التوزيع الخفي، مما يُحسّن بكفاءة جودة الصوت واستقرار التوليد؛ (3) نموذج توزيع متعدد الوسائط مبتكر يُعالج التنافس بين الوسائط، ويُدمج الصوت والفيديو عبر تدفقين منفصلين باستخدام الانتباه المشترك، ويُضفي السياق الدلالي النصي عبر الانتباه المتقاطع. أظهرت التقييمات الشاملة أن HunyuanVideo-Foley يحقق أداءً متفوقًا على مستوى الحد الأقصى الحالي في مجالات جودة الصوت، والتماشي البصري-الدلالي، والتماشي الزمني، وتطابق التوزيع. يمكن زيارة الصفحة التوضيحية عبر الرابط: https://szczesnys.github.io/hunyuanvideo-foley/

One-sentence Summary

Researchers from Tencent Hunyuan, Zhejiang University, and Nanjing University of Aeronautics and Astronautics propose HunyuanVideo-Foley, an end-to-end text-video-to-audio framework that generates high-fidelity, temporally aligned audio via multimodal diffusion transformers and self-supervised alignment, overcoming data scarcity and modality imbalance to enhance immersive video experiences.

Key Contributions

  • We introduce a scalable data pipeline that automatically curates a 100k-hour text-video-audio dataset, addressing multimodal scarcity and enabling robust training for video-to-audio synthesis.
  • Our Representation Alignment (REPA) loss leverages self-supervised audio features to guide latent diffusion training, improving audio fidelity and generation stability without requiring manual annotations.
  • HunyuanVideo-Foley employs a novel multimodal diffusion transformer with dual-stream fusion and cross-attention injection, resolving modality imbalance and achieving state-of-the-art alignment and quality across audio, visual, and textual semantics.

Introduction

The authors leverage recent advances in video generation to tackle the critical gap in synchronized audio, which limits immersion in synthetic media. Prior work in text-to-audio and video-to-audio generation suffers from limited multimodal data, modality imbalance favoring text over visual cues, and subpar audio fidelity that fails professional standards. HunyuanVideo-Foley introduces three key innovations: a scalable 100k-hour multimodal dataset pipeline, a representation alignment loss using self-supervised audio features to boost quality and stability, and a novel multimodal diffusion transformer that balances video-text-audio interactions via dual-stream fusion and cross-attention. The result is state-of-the-art performance in audio fidelity, temporal precision, and semantic alignment with both visual and textual inputs.

Dataset

The authors use a custom-built TV2A dataset to support multimodal audio generation, addressing the lack of high-quality, large-scale open-source data for text-video-audio tasks. Key details:

  • Dataset Composition & Sources:
    Built from raw video databases via a multi-stage filtering pipeline. Final dataset contains ~100k hours of text-video-audio material.

  • Subset Details & Filtering Rules:

    • Videos without audio streams are removed.
    • Remaining videos are segmented into 8-second chunks using scene detection.
    • Chunks with >80% silence are discarded.
    • Only audio with sampling rates >32 kHz is retained to ensure fidelity.
    • Audio quality is assessed via AudioBox-aesthetic-toolkit and SNR metrics; low-quality or noisy segments are filtered out.
    • Semantic and temporal audio-video alignment is verified using ImageBind and AV-align.
    • Segments are annotated with speech/music labels and audio categories for balanced training.
    • Audio captions are generated per segment using GenAU for descriptive grounding.
  • Usage in Model Training:
    The filtered, annotated, and captioned segments are used as training data. No explicit mixture ratios are mentioned, but category balancing is enforced via annotations.

  • Processing & Metadata:
    Cropping is done via 8-second fixed-length chunks. Metadata includes audio category tags, alignment scores, quality metrics, and generated captions—enabling structured training and evaluation.

Method

The authors leverage a hybrid transformer architecture—HunyuanVideo-Foley—to achieve modality-balanced, temporally coherent text-to-video-to-audio (TV2A) generation. The framework is structured into two distinct phases: an initial multimodal stage comprising N1N_1N1 transformer blocks that jointly process visual, textual, and audio latent representations, followed by N2N_2N2 unimodal transformer blocks dedicated exclusively to refining the audio stream. This design enables the model to first establish cross-modal alignment and then focus on high-fidelity audio synthesis.

As shown in the figure below, the input modalities are encoded independently: text is processed via a CLAP encoder, video frames through a SigLIP-2 visual encoder, and raw audio via a DAC-VAE encoder that compresses waveforms into continuous latent representations. These latents are perturbed with additive Gaussian noise to support a flow-matching diffusion objective. Synchronization features, extracted from a Synchformer visual encoder, provide frame-level temporal alignment signals that dynamically modulate the transformer blocks.

A key innovation lies in the dual-phase attention mechanism within the multimodal blocks. In self-attention, audio and visual latents are concatenated into a unified sequence after being interleaved temporally via an interleaved RoPE strategy. This ensures that adjacent audio and visual tokens receive consecutive positional embeddings, thereby enhancing the model’s ability to capture fine-grained temporal correlations. The fused sequence is then split into parallel streams, each processed through linear projections and gated by adaptive layer normalization (adaLN) layers conditioned on synchronization features and timestep embeddings. In cross-attention, the concatenated audio-visual sequence serves as the query, while CLAP-derived text embeddings provide key and value, enabling global semantic guidance without disrupting temporal structure.

The conditioning signal c\mathbf{c}c is derived from the sum of synchronization features csync\mathbf{c}_{\mathrm{sync}}csync and timestep embeddings ct\mathbf{c}_tct. This composite signal is passed through parallel MLPs to generate modulation parameters α\alphaα, β\betaβ, and gate ggg, which are applied to normalize and gate intermediate features. The modulated output is integrated via residual connections, ensuring stable propagation of temporal coherence across layers.

To further enhance audio fidelity, the authors introduce the REPA loss, which aligns intermediate hidden states from the diffusion transformer with frame-level audio representations extracted by a pre-trained ATST-Frame encoder. The alignment is computed via cosine similarity between mapped latents H=MLP(h)\mathbf{H} = \text{MLP}(\mathbf{h})H=MLP(h) and reference features Fr\mathbf{F}_rFr, encouraging the model to preserve semantic and acoustic structure during generation. This loss is computed at multiple layers and backpropagated to refine the audio stream before decoding via the DAC-VAE decoder.

The training pipeline is supported by a scalable data curation system that filters video-audio pairs based on audio-visual alignment (via ImageBind and AV-align) and audio quality (via AudioBox-aesthetic and SNR metrics). Bandwidth tagging is employed to condition the model on sampling rate, appending “high-quality” tags to captions for audio above 16 kHz, which improves high-frequency retention in generated outputs. The model is trained on 100k hours of data using 128 H20 GPUs, with 18 multimodal and 36 unimodal transformer layers, each with 1536 hidden dimensions and 12 attention heads. Classifier-free guidance is applied at a 0.1 dropout rate per modality to enhance controllability.

Experiment

  • HunyuanVideo-Foley sets new state-of-the-art in text-video-to-audio generation, excelling in visual-semantic alignment, audio quality, and temporal synchronization across Kling-Audio-Eval, VGGSound-Test, and MovieGen-Audio-Bench.
  • It outperforms baselines in most objective and subjective metrics, especially in IB, PQ, and DeSync, while maintaining competitive CLAP scores despite minor trade-offs in IS and CE on some datasets.
  • On VGGSound-Test, it lags slightly in distribution matching due to domain mismatch but leads in audio quality and maintains top IB performance.
  • Ablation studies confirm the superiority of its multimodal transformer design, particularly joint attention followed by cross-attention, and validate the effectiveness of interleaved RoPE and unimodal DiT.
  • Representation alignment with ATST yields optimal results; combining ATST and EAT degrades performance due to feature distribution conflicts.
  • REPA applied in unimodal DiT, especially in shallower layers, boosts alignment effectiveness.
  • DAC-VAE demonstrates robust audio reconstruction across diverse domains (speech, music, general sounds), outperforming prior methods in all evaluation metrics.
  • Spectrogram visualizations confirm precise temporal alignment and preservation of high-frequency content across dynamic scenarios.

The authors evaluate different configurations of their multimodal transformer architecture, finding that using a unimodal DiT at Layer 8 yields the best overall performance, particularly in production quality and content usefulness. While alternative layer depths maintain competitive scores, they show trade-offs in temporal alignment and audio quality. The results confirm that architectural choices significantly influence specific aspects of audio generation, with Layer 8 providing the most balanced outcomes.

The authors use HunyuanVideo-Foley to generate audio from text and video inputs, achieving strong performance across multiple evaluation metrics including audio quality, visual-semantic alignment, and temporal synchronization. While it underperforms slightly on some text-semantic alignment and distribution matching scores compared to MMAudio, it significantly improves in key areas like distribution match and visual alignment. Results show consistent advantages over baselines across diverse datasets, establishing new state-of-the-art performance in text-video-to-audio generation.

The authors evaluate different representation alignment strategies using EAT and ATST models, finding that ATST alone yields the best overall performance across audio quality, temporal alignment, and text-semantic consistency. Combining EAT and ATST degrades results, likely due to conflicting feature distributions. The optimal configuration uses ATST in unimodal DiT layers, particularly in shallower blocks, to enhance alignment without introducing noise or misalignment.

The authors use HunyuanVideo-Foley to generate audio from text and video inputs, achieving state-of-the-art performance across multiple datasets. Results show consistent improvements in visual-semantic alignment, audio quality, and temporal synchronization compared to baselines, with notable gains in distribution matching on the Kling-Audio-Eval dataset. While slightly trailing in some text-semantic metrics, the model demonstrates robust overall performance and superior reconstruction capabilities across diverse audio domains.

The authors evaluate HunyuanVideo-Foley against multiple baselines on objective and subjective metrics, showing consistent improvements in audio production quality, visual-semantic alignment, and temporal synchronization. While some baselines score higher on specific metrics like CLAP or content enjoyment, HunyuanVideo-Foley achieves the best overall subjective ratings across audio naturalness, scene matching, and timing accuracy. Results confirm its state-of-the-art performance in text-video-to-audio generation across diverse evaluation dimensions.


بناء الذكاء الاصطناعي بالذكاء الاصطناعي

من الفكرة إلى الإطلاق — سرّع تطوير الذكاء الاصطناعي الخاص بك مع المساعدة البرمجية المجانية بالذكاء الاصطناعي، وبيئة جاهزة للاستخدام، وأفضل أسعار لوحدات معالجة الرسومات.

البرمجة التعاونية باستخدام الذكاء الاصطناعي
وحدات GPU جاهزة للعمل
أفضل الأسعار

HyperAI Newsletters

اشترك في آخر تحديثاتنا
سنرسل لك أحدث التحديثات الأسبوعية إلى بريدك الإلكتروني في الساعة التاسعة من صباح كل يوم اثنين
مدعوم بواسطة MailChimp