ETTA: Elucidating the Design Space of Text-to-Audio Models

Recent years have seen significant progress in Text-To-Audio (TTA) synthesis,enabling users to enrich their creative workflows with synthetic audiogenerated from natural language prompts. Despite this progress, the effects ofdata, model architecture, training objective functions, and sampling strategieson target benchmarks are not well understood. With the purpose of providing aholistic understanding of the design space of TTA models, we set up alarge-scale empirical experiment focused on diffusion and flow matching models.Our contributions include: 1) AF-Synthetic, a large dataset of high qualitysynthetic captions obtained from an audio understanding model; 2) a systematiccomparison of different architectural, training, and inference design choicesfor TTA models; 3) an analysis of sampling methods and their Pareto curves withrespect to generation quality and inference speed. We leverage the knowledgeobtained from this extensive analysis to propose our best model dubbedElucidated Text-To-Audio (ETTA). When evaluated on AudioCaps and MusicCaps,ETTA provides improvements over the baselines trained on publicly availabledata, while being competitive with models trained on proprietary data. Finally,we show ETTA's improved ability to generate creative audio following complexand imaginative captions -- a task that is more challenging than currentbenchmarks.