HyperAIHyperAI

Command Palette

Search for a command to run...

Audio-Omni: 멀티모달 이해를 다재다능한 오디오 Generation 및 Editing으로 확장하기

초록

멀티모달 모델의 최근 발전은 오디오 이해(understanding), 생성(generation) 및 편집(editing) 분야의 급격한 진보를 촉발했습니다. 그러나 이러한 기능들은 일반적으로 특화된 모델들에 의해 다루어지고 있으며, 세 가지 작업을 모두 원활하게 통합할 수 있는 진정한 통합 프레임워크의 개발은 아직 미개척 분야로 남아 있습니다. 일부 선구적인 연구들이 오디오 이해와 생성을 통합하려는 시도를 해왔으나, 이들은 종종 특정 도메인에 국한되는 한계가 있었습니다.이를 해결하기 위해, 우리는 일반적인 사운드, 음악, 음성 도메인 전반에 걸쳐 생성과 편집을 통합하고 멀티모달 이해 기능까지 결합한 최초의 엔드투엔드(end-to-end) 프레임워크인 Audio-Omni를 소개합니다. 우리의 아키텍처는 고수준의 추론(reasoning)을 위한 고정된(frozen) Multimodal Large Language Model과 고충실도(high-fidelity) 합성을 위한 학습 가능한 Diffusion Transformer를 시너지화합니다. 오디오 편집 분야의 결정적인 데이터 부족 문제를 극복하기 위해, 우리는 100만 개 이상의 정교하게 큐레이션된 편집 쌍(editing pairs)으로 구성된 새로운 대규모 데이터셋인 AudioEdit를 구축했습니다.광범위한 실험 결과, Audio-Omni는 다양한 benchmark에서 최첨단(state-of-the-art) 성능을 달성하였으며, 기존의 통합 접근 방식들을 능가함과 동시에 특화된 전문가 모델(specialized expert models)과 대등하거나 그 이상의 성능을 보여주었습니다. 핵심 기능을 넘어 Audio-Omni는 지식 증강 추론 생성(knowledge-augmented reasoning generation), 인컨텍스트 생성(in-context generation), 그리고 오디오 생성을 위한 제로샷 교차 언어 제어(zero-shot cross-lingual control)를 포함한 놀라운 상속 능력(inherited capabilities)을 보여주며, 보편적인 생성형 오디오 지능(universal generative audio intelligence)을 향한 유망한 방향을 제시합니다.코드, 모델 및 데이터셋은 https://zeyuet.github.io/Audio-Omni 에서 공개될 예정입니다.

One-sentence Summary

The authors propose Audio-Omni, the first end-to-end framework to unify audio understanding, generation, and editing across sound, music, and speech domains by synergizing a frozen Multimodal Large Language Model with a trainable Diffusion Transformer and utilizing the new large-scale AudioEdit dataset to achieve state-of-the-art performance and versatile zero-shot control.

Key Contributions

  • The paper introduces Audio-Omni, an end-to-end framework that unifies audio understanding, generation, and editing across the sound, music, and speech domains. This architecture combines a frozen Multimodal Large Language Model for high-level reasoning with a trainable Diffusion Transformer and a hybrid conditioning mechanism to separate semantic and signal features.
  • This work presents AudioEdit, a large-scale dataset consisting of over one million meticulously curated instruction-guided editing pairs designed to overcome data scarcity in audio editing.
  • Experimental results demonstrate that Audio-Omni achieves state-of-the-art performance on multiple benchmarks, matching or exceeding the capabilities of specialized expert models while exhibiting inherited abilities like zero-shot cross-lingual control and knowledge-augmented reasoning.

Introduction

Modern audio processing relies on specialized models for understanding, generation, and editing, which prevents a seamless integration of these tasks. Existing unified approaches often lack end-to-end optimization or are restricted to a single domain like speech or music, while audio editing remains particularly difficult due to a lack of large-scale, instruction-guided datasets. The authors leverage a decoupled architecture that connects a frozen Multimodal Large Language Model for reasoning with a trainable Diffusion Transformer for high-fidelity synthesis. To enable versatile performance, they introduce Audio-Omni, the first end-to-end framework to unify understanding, generation, and editing across general sound, music, and speech, supported by their new large-scale AudioEdit dataset.

Dataset

Dataset overview
Dataset overview

The authors introduce AudioEdit, a large-scale dataset containing over 1 million samples designed for instruction-guided audio editing. The dataset is constructed through a hybrid pipeline consisting of two main branches:

  • Real Data Branch: This branch focuses on acoustic fidelity by mining authentic editing pairs from the VGGSound dataset. The authors use Gemini 2.5 Pro to identify primary sound categories and SAM-Audio for source separation. This process disentangles audio into a target track and a residual track. To ensure high quality, the authors apply a multi-stage filtering process:

    • Add, Remove, and Extract tasks: Starting from 540,000 labeled samples, the authors use Voice Activity Detection (VAD) to retain approximately 347,000 pairs, followed by CLAP-based semantic alignment to reach a final set of approximately 50,000 high-quality pairs.
    • Style Transfer tasks: The authors expand the filtered targets by using Gemini to generate semantically related keywords. After applying CLAP filtering, they obtain approximately 500,000 pairs. These are processed using ZETA to transform the audio style while preserving pitch and temporal structure, then mixed back with the residual track.
  • Synthesis Data Branch: This branch provides scale and diversity for add, remove, and extract tasks using the Scaper toolkit. The authors programmatically generate soundscapes by mixing foreground events from ESC-50 into 10-second backgrounds from AudioCaps. To increase complexity, they apply randomized parameters including onset time, SNR (0 to 3 dB), pitch shifts (-3 to +3 semitones), and time-stretch factors (0.8 to 1.2).

  • Dataset Usage: The resulting AudioEdit dataset provides a diverse mixture of tasks, including add, remove, extract, and style transfer, to support robust model training with both real-world acoustic characteristics and large-scale synthetic variety.

Method

The authors leverage the Rectified Flow framework as the generative backbone for their Audio-Omni system, which models a deterministic straight-line trajectory between noise and data samples through a constant velocity field. This approach contrasts with traditional diffusion models by using an ordinary differential equation (ODE) defined as dxtdt=v\frac{d \mathbf{x}_t}{dt} = \mathbf{v}dtdxt=v, where v=x1x0\mathbf{v} = \mathbf{x}_1 - \mathbf{x}_0v=x1x0 represents the velocity between a data sample x0\mathbf{x}_0x0 and a noise sample x1N(0,I)\mathbf{x}_1 \sim \mathcal{N}(\mathbf{0}, \mathbf{I})x1N(0,I). The solution along this path is given by xt=(1t)x0+tx1\mathbf{x}_t = (1 - t)\mathbf{x}_0 + t\mathbf{x}_1xt=(1t)x0+tx1 for t[0,1]t \in [0, 1]t[0,1]. A neural network vθ(xt,t,c)v_{\theta}(\mathbf{x}_t, t, \mathbf{c})vθ(xt,t,c) is trained to predict this velocity field conditioned on the noisy state xt\mathbf{x}_txt, time ttt, and conditioning signals c\mathbf{c}c. During inference, generation proceeds by solving the ODE backward from t=1t=1t=1 using predictions from vθv_{\theta}vθ, with the final output reconstructed via a VAE decoder.

Audio-Omni Framework
Audio-Omni Framework

The overall framework consists of two primary components: a frozen multimodal large language model (MLLM) serving as the understanding core and a trainable DiT-based backbone for audio generation and editing. The MLLM processes textual instructions, audio waveforms, and video inputs after they are tokenized by their respective encoders. It performs two key functions: generating textual responses for understanding tasks and producing a multimodal feature representation FmmRLmm×Dmm\mathbf{F}_{\mathrm{mm}} \in \mathbb{R}^{L_{\mathrm{mm}} \times D_{\mathrm{mm}}}FmmRLmm×Dmm from its penultimate layer, which serves as a conditioning signal for generative tasks. This feature is combined with a transcript-derived feature Ftrans\mathbf{F}_{\mathrm{trans}}Ftrans, obtained from a character-level encoding of the input text using a ConvNeXtV2-based Transcript Encoder, to form the High-Level Semantic Features stream chigh=Concat(Fmm,Ftrans)\mathbf{c}_{\mathrm{high}} = \operatorname{Concat}(\mathbf{F}_{\mathrm{mm}}, \mathbf{F}_{\mathrm{trans}})chigh=Concat(Fmm,Ftrans).

For tasks requiring precise temporal alignment, such as editing and synchronization, a second conditioning stream is introduced: the Low-Level Signal Features. This stream is constructed by concatenating a mel-spectrogram feature Fmel\mathbf{F}_{\mathrm{mel}}Fmel, extracted from a reference audio or speech prompt using a Mel Encoder, with a synchronization feature Fsync\mathbf{F}_{\mathrm{sync}}Fsync, derived from the input video via a pre-trained Synchformer model, resulting in clow=Concat(Fsync,Fmel)\mathbf{c}_{\mathrm{low}} = \operatorname{Concat}(\mathbf{F}_{\mathrm{sync}}, \mathbf{F}_{\mathrm{mel}})clow=Concat(Fsync,Fmel). These two conditioning streams are injected into the DiT backbone through distinct mechanisms. The High-Level Semantic Features are injected as context via cross-attention, enabling the model to attend to abstract instructions throughout the generation process. In contrast, the Low-Level Signal Features are fused with a time embedding through element-wise addition and then concatenated with the VAE-encoded noisy audio latent xt\mathbf{x}_txt to form the primary input to the DiT, providing strong frame-by-frame guidance.

Model Architecture
Model Architecture

The training objective is a unified Rectified Flow loss, which minimizes the mean squared error between the predicted velocity vθ(xt,t,c)v_{\theta}(\mathbf{x}_t, t, \mathbf{c})vθ(xt,t,c) and the ground-truth velocity v=x1x0\mathbf{v} = \mathbf{x}_1 - \mathbf{x}_0v=x1x0. The loss function is defined as:

L=EtU(0,1),x0,x1,c[vθ(xt,t,c)(x1x0)2]\mathcal{L} = \mathbb{E}_{t \sim \mathcal{U}(0, 1), \mathbf{x}_0, \mathbf{x}_1, \mathbf{c}} \left[ || v_{\theta}(\mathbf{x}_t, t, \mathbf{c}) - (\mathbf{x}_1 - \mathbf{x}_0) ||^2 \right]L=EtU(0,1),x0,x1,c[∣∣vθ(xt,t,c)(x1x0)2]

where ttt is a randomly sampled timestep from a uniform distribution, xt=(1t)x0+tx1\mathbf{x}_t = (1-t)\mathbf{x}_0 + t\mathbf{x}_1xt=(1t)x0+tx1 is the interpolated latent state, and c\mathbf{c}c encompasses the full conditioning signals for the training sample. This objective enables the model to learn a single, unified representation for a wide range of audio generation and editing tasks.

Experiment

Audio-Omni is evaluated through a comprehensive suite of benchmarks designed to test its understanding, generation, and editing capabilities across the full spectrum of sound, music, and speech. The experiments validate that the decoupled architecture allows the model to inherit strong reasoning and multilingual abilities from a frozen MLLM while achieving state-of-the-art performance in generative and editing tasks. Qualitative results further demonstrate emergent zero-shot capabilities, such as knowledge-augmented generation and voice conversion, proving that a single unified framework can serve as a versatile generalist for diverse audio domains.

Results show that Audio-Omni achieves superior performance across audio editing metrics compared to specialized models. The model outperforms others in fidelity and instruction adherence, demonstrating strong capabilities in editing tasks. Audio-Omni outperforms specialized models on all editing metrics Audio-Omni achieves the best results in fidelity and instruction adherence The model demonstrates strong performance across multiple editing tasks

Audio editing performance comparison
Audio editing performance comparison

The the the table presents an ablation study on the impact of dataset composition for audio editing training, comparing performance across different training configurations. Results show that combining synthetic and real-world data achieves the best overall performance, with the mixed approach outperforming either data type used alone. Combining synthetic and real-world data yields the best performance for audio editing. Training on real-world data alone achieves better results than using synthetic data alone. The mixed data approach consistently outperforms single-data configurations across all evaluation metrics.

Ablation on dataset composition
Ablation on dataset composition

The results show a comparison of audio editing models across multiple tasks, with the proposed method achieving the best overall performance. Audio-Omni demonstrates superior results in both fidelity and quality metrics across all editing operations compared to existing models. Audio-Omni outperforms all baseline models in average performance across editing tasks The proposed method achieves the lowest scores in both FAD and LSD metrics, indicating higher fidelity and quality Audio-Omni shows consistent improvement over baselines in all individual editing operations

Audio editing performance comparison
Audio editing performance comparison

Results show that Audio-Omni achieves strong performance across multiple audio tasks, consistently outperforming other unified models and matching or exceeding specialized models in various benchmarks. The framework demonstrates superior results in both understanding and generation tasks, highlighting its effectiveness as a comprehensive audio system. Audio-Omni surpasses other unified models in understanding and generation tasks The model achieves competitive results compared to specialized expert models It demonstrates strong performance across diverse audio domains including speech, music, and sound

Audio-Omni outperforms prior models
Audio-Omni outperforms prior models

The study compares different feature sources from a frozen MLLM for audio generation tasks, evaluating their impact on text-to-audio and text-to-music performance. Results show that using features from the penultimate layer consistently outperforms other methods across both tasks. Using features from the penultimate layer of the MLLM achieves the best performance for both text-to-audio and text-to-music tasks. The last layer features perform worse than the penultimate layer, indicating over-specialization for text prediction. Complex query mechanisms like MetaQuery and Query degrade performance compared to direct feature extraction from the penultimate layer.

Ablation on feature source selection
Ablation on feature source selection

Evaluations demonstrate that Audio-Omni achieves superior fidelity and instruction adherence across diverse audio editing and general tasks, often outperforming both specialized models and existing unified frameworks. Ablation studies reveal that training with a combination of synthetic and real-world data yields the best results, while extracting features from the penultimate layer of a frozen MLLM provides optimal performance for generation tasks. Collectively, these findings highlight the effectiveness of the proposed model and its robust capabilities across speech, music, and sound domains.


AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩
바로 사용 가능한 GPU
최적의 가격

HyperAI Newsletters

최신 정보 구독하기
한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다
이메일 서비스 제공: MailChimp