2 months ago

Efficient Neural Music Generation

Lam, Max W. Y. ; Tian, Qiao ; Li, Tang ; Yin, Zongyu ; Feng, Siyuan ; Tu, Ming ; Ji, Yuliang ; Xia, Rui ; Ma, Mingbo ; Song, Xuchen ; Chen, Jitong ; Wang, Yuping ; Wang, Yuxuan

View Paper Details

Abstract

Recent progress in music generation has been remarkably advanced by thestate-of-the-art MusicLM, which comprises a hierarchy of three LMs,respectively, for semantic, coarse acoustic, and fine acoustic modelings. Yet,sampling with the MusicLM requires processing through these LMs one by one toobtain the fine-grained acoustic tokens, making it computationally expensiveand prohibitive for a real-time generation. Efficient music generation with aquality on par with MusicLM remains a significant challenge. In this paper, wepresent MeLoDy (M for music; L for LM; D for diffusion), an LM-guided diffusionmodel that generates music audios of state-of-the-art quality meanwhilereducing 95.7% or 99.6% forward passes in MusicLM, respectively, for sampling10s or 30s music. MeLoDy inherits the highest-level LM from MusicLM forsemantic modeling, and applies a novel dual-path diffusion (DPD) model and anaudio VAE-GAN to efficiently decode the conditioning semantic tokens intowaveform. DPD is proposed to simultaneously model the coarse and fine acousticsby incorporating the semantic information into segments of latents effectivelyvia cross-attention at each denoising step. Our experimental results suggestthe superiority of MeLoDy, not only in its practical advantages on samplingspeed and infinitely continuable generation, but also in its state-of-the-artmusicality, audio quality, and text correlation. Our samples are available at https://Efficient-MeLoDy.github.io/.