Fast Timing-Conditioned Latent Audio Diffusion

Generating long-form 44.1kHz stereo audio from text prompts can becomputationally demanding. Further, most previous works do not tackle thatmusic and sound effects naturally vary in their duration. Our research focuseson the efficient generation of long-form, variable-length stereo music andsounds at 44.1kHz using text prompts with a generative model. Stable Audio isbased on latent diffusion, with its latent defined by a fully-convolutionalvariational autoencoder. It is conditioned on text prompts as well as timingembeddings, allowing for fine control over both the content and length of thegenerated music and sounds. Stable Audio is capable of rendering stereo signalsof up to 95 sec at 44.1kHz in 8 sec on an A100 GPU. Despite its computeefficiency and fast inference, it is one of the best in two publictext-to-music and -audio benchmarks and, differently from state-of-the-artmodels, can generate music with structure and stereo sounds.