QA-MDT: Quality-aware Masked Diffusion Transformer for Enhanced Music Generation

Text-to-music (TTM) generation, which converts textual descriptions intoaudio, opens up innovative avenues for multimedia creation. Achieving highquality and diversity in this process demands extensive, high-quality data,which are often scarce in available datasets. Most open-source datasetsfrequently suffer from issues like low-quality waveforms and low text-audioconsistency, hindering the advancement of music generation models. To addressthese challenges, we propose a novel quality-aware training paradigm forgenerating high-quality, high-musicality music from large-scale,quality-imbalanced datasets. Additionally, by leveraging unique properties inthe latent space of musical signals, we adapt and implement a masked diffusiontransformer (MDT) model for the TTM task, showcasing its capacity for qualitycontrol and enhanced musicality. Furthermore, we introduce a three-stagecaption refinement approach to address low-quality captions' issue. Experimentsshow state-of-the-art (SOTA) performance on benchmark datasets includingMusicCaps and the Song-Describer Dataset with both objective and subjectivemetrics. Demo audio samples are available at https://qa-mdt.github.io/, codeand pretrained checkpoints are open-sourced athttps://github.com/ivcylc/OpenMusic.