Simple and Controllable Music Generation

We tackle the task of conditional music generation. We introduce MusicGen, asingle Language Model (LM) that operates over several streams of compresseddiscrete music representation, i.e., tokens. Unlike prior work, MusicGen iscomprised of a single-stage transformer LM together with efficient tokeninterleaving patterns, which eliminates the need for cascading several models,e.g., hierarchically or upsampling. Following this approach, we demonstrate howMusicGen can generate high-quality samples, both mono and stereo, while beingconditioned on textual description or melodic features, allowing bettercontrols over the generated output. We conduct extensive empirical evaluation,considering both automatic and human studies, showing the proposed approach issuperior to the evaluated baselines on a standard text-to-music benchmark.Through ablation studies, we shed light over the importance of each of thecomponents comprising MusicGen. Music samples, code, and models are availableat https://github.com/facebookresearch/audiocraft