HyperAIHyperAI

Command Palette

Search for a command to run...

adversarial post-trainingを用いた高速なテキストからオーディオ生成

Abstract

テキストから音声への変換システムは、性能が年々向上しているものの、推論時の遅延が大きく、多くのクリエイティブな応用において実用的な遅延を満たしておらず、課題となっている。本研究では、蒸留(distillation)に依拠しない、初めての拡散モデル/フロー・モデル向けの敵対的加速アルゴリズムとして、敵対的相対的対比的(Adversarial Relativistic-Contrastive: ARC)後学習法を提案する。従来の敵対的後学習法は、高コストな蒸留法と比較して性能で劣っていたが、本手法は以下の2点を特徴とするシンプルなプロセスである。(1)最近提唱された相対的敵対的枠組みを拡散/フロー後学習に拡張し、(2)新しい対比的識別器(contrastive discriminator)の目的関数と組み合わせることで、プロンプトへの忠実度を高める。さらに、Stable Audio Openに複数の最適化手法を組み合わせ、H100 GPU上で約75msで44.1kHzステレオ音声約12秒、モバイルエッジデバイス上では約7秒の音声生成が可能なモデルを構築した。これは、本研究の知る限り、最も高速なテキストから音声への変換モデルである。

One-sentence Summary

The authors, from UC San Diego, Stability AI, and Arm, propose ARC (Adversarial Relativistic-Contrastive) post-training—a novel distillation-free acceleration method for text-to-audio diffusion/flow models that combines a relativistic adversarial loss with a contrastive discriminator to enhance prompt adherence and realism, enabling 12-second 44.1kHz stereo audio generation in 75ms on an H100 and 7 seconds on mobile edge devices, significantly advancing real-time creative applications.

Key Contributions

  • Text-to-audio generation remains slow due to the iterative sampling nature of diffusion and flow models, limiting real-time creative applications despite recent advances in model quality.
  • The authors introduce Adversarial Relativistic-Contrastive (ARC) post-training, a novel distillation-free method that combines a relativistic adversarial loss with a contrastive discriminator objective to improve audio realism and prompt adherence during accelerated inference.
  • ARC enables generation of approximately 12 seconds of 44.1kHz stereo audio in just 75ms on an H100 GPU and under 7 seconds on mobile edge devices, outperforming prior methods in speed and diversity while being the first fully adversarial, non-distillation approach for audio flow models.

Introduction

Text-to-audio generation has made significant strides in quality, but inference latency remains a major bottleneck—current models often require seconds to minutes per generation, limiting real-time creative applications. Prior acceleration methods rely heavily on distillation, which demands substantial computational resources for training and storage, and often inherits the drawbacks of Classifier-Free Guidance, such as reduced diversity and over-saturation. Some post-training approaches avoid distillation by using adversarial losses, but these have seen limited success in audio due to weak prompt adherence and lack of effective training recipes. The authors introduce Adversarial Relativistic-Contrastive (ARC) post-training, a novel framework that extends relativistic adversarial training to text-conditioned audio and introduces a contrastive discriminator objective to enforce prompt fidelity. This approach enables fast, high-quality audio generation without distillation or CFG, achieving 12 seconds of 44.1kHz stereo audio in just 75ms on an H100 GPU—100x faster than the original model—and enabling on-device inference in ~7 seconds on mobile CPUs, making it the fastest text-to-audio system to date.

Method

The authors leverage a two-stage framework for text-to-audio generation, beginning with a pre-trained rectified flow model and proceeding to adversarial post-training to accelerate sampling while preserving quality and prompt adherence. The overall architecture is built upon a latent diffusion model that operates in a compressed audio space. The base model consists of a 156M parameter autoencoder from SAO, which compresses stereo audio waveforms into a 64-channel latent representation at a 21.5Hz resolution. This latent space is conditioned on text prompts encoded by a 109M parameter T5 text embedder. The core generative component is a Diffusion Transformer (DiT), initially pre-trained as a rectified flow model to predict the velocity of the flow vθ(xt,t,c)v_{\theta}(\mathbf{x}_t, t, \mathbf{c})vθ(xt,t,c), which is used to reverse the forward noising process defined by xt=(1t)x0+tϵ\mathbf{x}_t = (1 - t)\mathbf{x}_0 + t\pmb{\epsilon}xt=(1t)x0+tϵ.

As shown in the figure below, the post-training stage transforms the pre-trained velocity predictor vθv_{\theta}vθ into a few-step generator GϕG_{\phi}Gϕ by reparameterizing the model to directly output clean audio x^0\hat{\mathbf{x}}_0x^0 from a noisy input xt\mathbf{x}_txt. This is achieved through adversarial post-training, where the generator GϕG_{\phi}Gϕ and a discriminator DψD_{\psi}Dψ are jointly optimized. The discriminator is initialized from the pre-trained DiT, using its input embedding layers and 75% of its blocks, and is augmented with a lightweight 1D convolutional head. The training process involves corrupting real audio x0\mathbf{x}_0x0 with noise to obtain xt\mathbf{x}_txt, which is then passed to the generator to produce a denoised sample x^0\hat{\mathbf{x}}_0x^0. Both the real and generated samples are subsequently re-noised to a lower noise level sss to form the inputs for the discriminator. The generator is trained to minimize a relativistic adversarial loss LR\mathcal{L}_{\mathrm{R}}LR, which compares the discriminator's output on a generated sample to its output on a paired real sample, encouraging the generator to produce outputs that are "more real" than their real counterparts. This loss is defined as LR(ϕ,ψ)=E[f(ΔgenΔreal)]\mathcal{L}_{\mathrm{R}}(\phi, \psi) = \mathbb{E}[f(\Delta_{\mathrm{gen}} - \Delta_{\mathrm{real}})]LR(ϕ,ψ)=E[f(ΔgenΔreal)], where f(x)=log(1+ex)f(x) = -\log(1 + e^{-x})f(x)=log(1+ex), Δgen\Delta_{\mathrm{gen}}Δgen is the discriminator logit for the generated sample, and Δreal\Delta_{\mathrm{real}}Δreal is the logit for the real sample.

To address the issue of poor prompt adherence that can arise from the realism-focused adversarial loss, the authors introduce a contrastive loss LC\mathcal{L}_{\mathrm{C}}LC for the discriminator. This loss is designed to improve the discriminator's ability to understand the alignment between audio and text prompts. It is applied by shuffling the text prompts within a batch, creating mismatched audio-text pairs, and training the discriminator to maximize the difference between the logits for correct pairs and incorrect pairs. The loss is formulated as LC(ψ)=E[f(Δreal(x0,s,P[c])Δreal(x0,s,c))]\mathcal{L}_{\mathrm{C}}(\psi) = \mathbb{E}[f(\Delta_{\mathrm{real}}(\mathbf{x}_0, s, \mathcal{P}[\mathbf{c}]) - \Delta_{\mathrm{real}}(\mathbf{x}_0, s, \mathbf{c}))]LC(ψ)=E[f(Δreal(x0,s,P[c])Δreal(x0,s,c))], where P[]\mathcal{P}[\cdot]P[] denotes a random permutation of the prompts. This contrastive objective encourages the discriminator to focus on semantic features rather than spurious correlations, thereby providing a stronger gradient signal for the generator to improve prompt adherence. The total post-training objective is the sum of the relativistic adversarial loss and the contrastive loss, LARC(ϕ,ψ)=LR(ϕ,ψ)+λLC(ψ)\mathcal{L}_{\mathrm{ARC}}(\phi, \psi) = \mathcal{L}_{\mathrm{R}}(\phi, \psi) + \lambda \cdot \mathcal{L}_{\mathrm{C}}(\psi)LARC(ϕ,ψ)=LR(ϕ,ψ)+λLC(ψ).

After post-training, the model is used for inference with a specialized sampling strategy. The generator GϕG_{\phi}Gϕ is designed to directly estimate clean outputs from noisy inputs, which necessitates a change from the traditional ODE solvers used by rectified flows. Instead, the authors employ ping-pong sampling, which alternates between denoising a sample using GϕG_{\phi}Gϕ and re-noising it to a lower noise level. This iterative refinement process allows the model to produce high-fidelity audio in a small number of steps, significantly accelerating the generation process.

Experiment

  • Evaluated accelerated text-to-audio models on AudioCaps test set (4,875 generations), validating performance across audio quality, semantic alignment, prompt adherence, and diversity using FDopen13, KLpasst, CLAP score, Rpasst, Cpasst, and proposed CCDS metric.
  • Achieved 100x speedup over SAO (100-step) and 10x over pre-trained RF (50-step) with competitive metrics, while ARC post-training improved diversity without significant quality loss.
  • ARC outperformed Presto in diversity (higher CCDS and MOS diversity) despite slightly lower quality, with Presto showing high quality but severe diversity reduction.
  • 8-step inference achieved best results, aligning with recent findings on step efficiency in accelerated models; ablations confirmed the importance of relativistic loss and joint LR\mathcal{L}_RLR and LC\mathcal{L}_CLC training.
  • CCDS metric showed strong correlation with subjective diversity MOS, validating its effectiveness for conditional diversity assessment.
  • On edge devices (Vivo X200 Pro), dynamic Int8 quantization reduced inference time from 15.3s to 6.6s and peak VRAM from 6.5GB to 3.6GB, enabling sub-200ms latency on consumer GPUs.
  • Demonstrated practical creative applications including real-time sound design, style transfer, voice-to-audio control, and beat-aligned generation, highlighting responsiveness and versatility.

The authors use a rectified flow model trained with ARC post-training to achieve significant speed improvements while maintaining high audio quality and prompt adherence. Results show that their method achieves competitive performance with state-of-the-art models like SAO and Presto, but with substantially lower latency and higher diversity, particularly when using 8 steps, while also demonstrating effective edge-device optimization through quantization.


Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

HyperAI Newsletters

最新情報を購読する
北京時間 毎週月曜日の午前9時 に、その週の最新情報をメールでお届けします
メール配信サービスは MailChimp によって提供されています