HyperAIHyperAI

Command Palette

Search for a command to run...

Génération rapide de texte vers audio avec entraînement post-adjoint adversarial

Abstract

Les systèmes de texte vers audio, bien qu’ils soient de plus en plus performants, présentent une latence élevée lors de l’inférence, ce qui les rend inappropriés pour de nombreuses applications créatives. Nous introduisons ARC (Adversarial Relativistic-Contrastive), une méthode d’accélération post-entraînement par adversarité, la première pour les modèles de diffusion/flux ne reposant pas sur la distillation. Alors que les méthodes antérieures d’accélération post-entraînement par adversarité ont éprouvé des difficultés à rivaliser avec leurs homologues coûteuses basées sur la distillation, la méthode ARC repose sur une procédure simple qui (1) étend une formulation récente d’adversarité relativiste au cadre post-entraînement pour les modèles de diffusion/flux, et (2) la combine avec une nouvelle fonction objective pour le discriminateur contrastif afin d’améliorer l’adhésion aux prompts. En associant ARC à plusieurs optimisations appliquées à Stable Audio Open, nous avons développé un modèle capable de générer environ 12 secondes d’audio stéréo à 44,1 kHz en environ 75 ms sur une H100, et environ 7 secondes sur un dispositif mobile edge, constituant ainsi, à notre connaissance, le modèle texte-vers-audio le plus rapide à ce jour.

One-sentence Summary

The authors, from UC San Diego, Stability AI, and Arm, propose ARC (Adversarial Relativistic-Contrastive) post-training—a novel distillation-free acceleration method for text-to-audio diffusion/flow models that combines a relativistic adversarial loss with a contrastive discriminator to enhance prompt adherence and realism, enabling 12-second 44.1kHz stereo audio generation in 75ms on an H100 and 7 seconds on mobile edge devices, significantly advancing real-time creative applications.

Key Contributions

  • Text-to-audio generation remains slow due to the iterative sampling nature of diffusion and flow models, limiting real-time creative applications despite recent advances in model quality.
  • The authors introduce Adversarial Relativistic-Contrastive (ARC) post-training, a novel distillation-free method that combines a relativistic adversarial loss with a contrastive discriminator objective to improve audio realism and prompt adherence during accelerated inference.
  • ARC enables generation of approximately 12 seconds of 44.1kHz stereo audio in just 75ms on an H100 GPU and under 7 seconds on mobile edge devices, outperforming prior methods in speed and diversity while being the first fully adversarial, non-distillation approach for audio flow models.

Introduction

Text-to-audio generation has made significant strides in quality, but inference latency remains a major bottleneck—current models often require seconds to minutes per generation, limiting real-time creative applications. Prior acceleration methods rely heavily on distillation, which demands substantial computational resources for training and storage, and often inherits the drawbacks of Classifier-Free Guidance, such as reduced diversity and over-saturation. Some post-training approaches avoid distillation by using adversarial losses, but these have seen limited success in audio due to weak prompt adherence and lack of effective training recipes. The authors introduce Adversarial Relativistic-Contrastive (ARC) post-training, a novel framework that extends relativistic adversarial training to text-conditioned audio and introduces a contrastive discriminator objective to enforce prompt fidelity. This approach enables fast, high-quality audio generation without distillation or CFG, achieving 12 seconds of 44.1kHz stereo audio in just 75ms on an H100 GPU—100x faster than the original model—and enabling on-device inference in ~7 seconds on mobile CPUs, making it the fastest text-to-audio system to date.

Method

The authors leverage a two-stage framework for text-to-audio generation, beginning with a pre-trained rectified flow model and proceeding to adversarial post-training to accelerate sampling while preserving quality and prompt adherence. The overall architecture is built upon a latent diffusion model that operates in a compressed audio space. The base model consists of a 156M parameter autoencoder from SAO, which compresses stereo audio waveforms into a 64-channel latent representation at a 21.5Hz resolution. This latent space is conditioned on text prompts encoded by a 109M parameter T5 text embedder. The core generative component is a Diffusion Transformer (DiT), initially pre-trained as a rectified flow model to predict the velocity of the flow vθ(xt,t,c)v_{\theta}(\mathbf{x}_t, t, \mathbf{c})vθ(xt,t,c), which is used to reverse the forward noising process defined by xt=(1t)x0+tϵ\mathbf{x}_t = (1 - t)\mathbf{x}_0 + t\pmb{\epsilon}xt=(1t)x0+tϵ.

As shown in the figure below, the post-training stage transforms the pre-trained velocity predictor vθv_{\theta}vθ into a few-step generator GϕG_{\phi}Gϕ by reparameterizing the model to directly output clean audio x^0\hat{\mathbf{x}}_0x^0 from a noisy input xt\mathbf{x}_txt. This is achieved through adversarial post-training, where the generator GϕG_{\phi}Gϕ and a discriminator DψD_{\psi}Dψ are jointly optimized. The discriminator is initialized from the pre-trained DiT, using its input embedding layers and 75% of its blocks, and is augmented with a lightweight 1D convolutional head. The training process involves corrupting real audio x0\mathbf{x}_0x0 with noise to obtain xt\mathbf{x}_txt, which is then passed to the generator to produce a denoised sample x^0\hat{\mathbf{x}}_0x^0. Both the real and generated samples are subsequently re-noised to a lower noise level sss to form the inputs for the discriminator. The generator is trained to minimize a relativistic adversarial loss LR\mathcal{L}_{\mathrm{R}}LR, which compares the discriminator's output on a generated sample to its output on a paired real sample, encouraging the generator to produce outputs that are "more real" than their real counterparts. This loss is defined as LR(ϕ,ψ)=E[f(ΔgenΔreal)]\mathcal{L}_{\mathrm{R}}(\phi, \psi) = \mathbb{E}[f(\Delta_{\mathrm{gen}} - \Delta_{\mathrm{real}})]LR(ϕ,ψ)=E[f(ΔgenΔreal)], where f(x)=log(1+ex)f(x) = -\log(1 + e^{-x})f(x)=log(1+ex), Δgen\Delta_{\mathrm{gen}}Δgen is the discriminator logit for the generated sample, and Δreal\Delta_{\mathrm{real}}Δreal is the logit for the real sample.

To address the issue of poor prompt adherence that can arise from the realism-focused adversarial loss, the authors introduce a contrastive loss LC\mathcal{L}_{\mathrm{C}}LC for the discriminator. This loss is designed to improve the discriminator's ability to understand the alignment between audio and text prompts. It is applied by shuffling the text prompts within a batch, creating mismatched audio-text pairs, and training the discriminator to maximize the difference between the logits for correct pairs and incorrect pairs. The loss is formulated as LC(ψ)=E[f(Δreal(x0,s,P[c])Δreal(x0,s,c))]\mathcal{L}_{\mathrm{C}}(\psi) = \mathbb{E}[f(\Delta_{\mathrm{real}}(\mathbf{x}_0, s, \mathcal{P}[\mathbf{c}]) - \Delta_{\mathrm{real}}(\mathbf{x}_0, s, \mathbf{c}))]LC(ψ)=E[f(Δreal(x0,s,P[c])Δreal(x0,s,c))], where P[]\mathcal{P}[\cdot]P[] denotes a random permutation of the prompts. This contrastive objective encourages the discriminator to focus on semantic features rather than spurious correlations, thereby providing a stronger gradient signal for the generator to improve prompt adherence. The total post-training objective is the sum of the relativistic adversarial loss and the contrastive loss, LARC(ϕ,ψ)=LR(ϕ,ψ)+λLC(ψ)\mathcal{L}_{\mathrm{ARC}}(\phi, \psi) = \mathcal{L}_{\mathrm{R}}(\phi, \psi) + \lambda \cdot \mathcal{L}_{\mathrm{C}}(\psi)LARC(ϕ,ψ)=LR(ϕ,ψ)+λLC(ψ).

After post-training, the model is used for inference with a specialized sampling strategy. The generator GϕG_{\phi}Gϕ is designed to directly estimate clean outputs from noisy inputs, which necessitates a change from the traditional ODE solvers used by rectified flows. Instead, the authors employ ping-pong sampling, which alternates between denoising a sample using GϕG_{\phi}Gϕ and re-noising it to a lower noise level. This iterative refinement process allows the model to produce high-fidelity audio in a small number of steps, significantly accelerating the generation process.

Experiment

  • Evaluated accelerated text-to-audio models on AudioCaps test set (4,875 generations), validating performance across audio quality, semantic alignment, prompt adherence, and diversity using FDopen13, KLpasst, CLAP score, Rpasst, Cpasst, and proposed CCDS metric.
  • Achieved 100x speedup over SAO (100-step) and 10x over pre-trained RF (50-step) with competitive metrics, while ARC post-training improved diversity without significant quality loss.
  • ARC outperformed Presto in diversity (higher CCDS and MOS diversity) despite slightly lower quality, with Presto showing high quality but severe diversity reduction.
  • 8-step inference achieved best results, aligning with recent findings on step efficiency in accelerated models; ablations confirmed the importance of relativistic loss and joint LR\mathcal{L}_RLR and LC\mathcal{L}_CLC training.
  • CCDS metric showed strong correlation with subjective diversity MOS, validating its effectiveness for conditional diversity assessment.
  • On edge devices (Vivo X200 Pro), dynamic Int8 quantization reduced inference time from 15.3s to 6.6s and peak VRAM from 6.5GB to 3.6GB, enabling sub-200ms latency on consumer GPUs.
  • Demonstrated practical creative applications including real-time sound design, style transfer, voice-to-audio control, and beat-aligned generation, highlighting responsiveness and versatility.

The authors use a rectified flow model trained with ARC post-training to achieve significant speed improvements while maintaining high audio quality and prompt adherence. Results show that their method achieves competitive performance with state-of-the-art models like SAO and Presto, but with substantially lower latency and higher diversity, particularly when using 8 steps, while also demonstrating effective edge-device optimization through quantization.


Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

HyperAI Newsletters

Abonnez-vous à nos dernières mises à jour
Nous vous enverrons les dernières mises à jour de la semaine dans votre boîte de réception à neuf heures chaque lundi matin
Propulsé par MailChimp