Command Palette
Search for a command to run...
Improving Text-To-Audio Models with Synthetic Captions
Improving Text-To-Audio Models with Synthetic Captions
Zhifeng Kong Sang-gil Lee Deepanway Ghosal Navonil Majumder Ambuj Mehrish Rafael Valle Soujanya Poria Bryan Catanzaro
Abstract
It is an open challenge to obtain high quality training data, especiallycaptions, for text-to-audio models. Although prior methods have leveraged\textit{text-only language models} to augment and improve captions, suchmethods have limitations related to scale and coherence between audio andcaptions. In this work, we propose an audio captioning pipeline that uses an\textit{audio language model} to synthesize accurate and diverse captions foraudio at scale. We leverage this pipeline to produce a dataset of syntheticcaptions for AudioSet, named \texttt{AF-AudioSet}, and then evaluate thebenefit of pre-training text-to-audio models on these synthetic captions.Through systematic evaluations on AudioCaps and MusicCaps, we find leveragingour pipeline and synthetic captions leads to significant improvements on audiogeneration quality, achieving a new \textit{state-of-the-art}.