8 months ago

Ho Kei Cheng Masato Ishii Akio Hayakawa Takashi Shibuya Alexander Schwing Yuki Mitsufuji

Abstract

We propose to synthesize high-quality and synchronized audio, given video andoptional text conditions, using a novel multimodal joint training frameworkMMAudio. In contrast to single-modality training conditioned on (limited) videodata only, MMAudio is jointly trained with larger-scale, readily availabletext-audio data to learn to generate semantically aligned high-quality audiosamples. Additionally, we improve audio-visual synchrony with a conditionalsynchronization module that aligns video conditions with audio latents at theframe level. Trained with a flow matching objective, MMAudio achieves newvideo-to-audio state-of-the-art among public models in terms of audio quality,semantic alignment, and audio-visual synchronization, while having a lowinference time (1.23s to generate an 8s clip) and just 157M parameters. MMAudioalso achieves surprisingly competitive performance in text-to-audio generation,showing that joint training does not hinder single-modality performance. Codeand demo are available at: https://hkchengrex.github.io/MMAudio

Source PDF View Code

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp

8 months ago

Ho Kei Cheng Masato Ishii Akio Hayakawa Takashi Shibuya Alexander Schwing Yuki Mitsufuji

Abstract

We propose to synthesize high-quality and synchronized audio, given video andoptional text conditions, using a novel multimodal joint training frameworkMMAudio. In contrast to single-modality training conditioned on (limited) videodata only, MMAudio is jointly trained with larger-scale, readily availabletext-audio data to learn to generate semantically aligned high-quality audiosamples. Additionally, we improve audio-visual synchrony with a conditionalsynchronization module that aligns video conditions with audio latents at theframe level. Trained with a flow matching objective, MMAudio achieves newvideo-to-audio state-of-the-art among public models in terms of audio quality,semantic alignment, and audio-visual synchronization, while having a lowinference time (1.23s to generate an 8s clip) and just 157M parameters. MMAudioalso achieves surprisingly competitive performance in text-to-audio generation,showing that joint training does not hinder single-modality performance. Codeand demo are available at: https://hkchengrex.github.io/MMAudio

Source PDF View Code

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp

Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis | Papers | HyperAI