TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization

We introduce TangoFlux, an efficient Text-to-Audio (TTA) generative modelwith 515M parameters, capable of generating up to 30 seconds of 44.1kHz audioin just 3.7 seconds on a single A40 GPU. A key challenge in aligning TTA modelslies in the difficulty of creating preference pairs, as TTA lacks structuredmechanisms like verifiable rewards or gold-standard answers available for LargeLanguage Models (LLMs). To address this, we propose CLAP-Ranked PreferenceOptimization (CRPO), a novel framework that iteratively generates and optimizespreference data to enhance TTA alignment. We demonstrate that the audiopreference dataset generated using CRPO outperforms existing alternatives. Withthis framework, TangoFlux achieves state-of-the-art performance across bothobjective and subjective benchmarks. We open source all code and models tosupport further research in TTA generation.