HyperAIHyperAI

Command Palette

Search for a command to run...

QA-MDT: Quality-aware Masked Diffusion Transformer for Enhanced Music Generation

Chang Li* Ruoyu Wang* Lijuan Liu Jun Du† Yixuan Sun Zilu Guo Zhengrong Zhang Yuan Jiang Jianqing Gao Feng Ma

Abstract

Text-to-music (TTM) generation, which converts textual descriptions intoaudio, opens up innovative avenues for multimedia creation. Achieving highquality and diversity in this process demands extensive, high-quality data,which are often scarce in available datasets. Most open-source datasetsfrequently suffer from issues like low-quality waveforms and low text-audioconsistency, hindering the advancement of music generation models. To addressthese challenges, we propose a novel quality-aware training paradigm forgenerating high-quality, high-musicality music from large-scale,quality-imbalanced datasets. Additionally, by leveraging unique properties inthe latent space of musical signals, we adapt and implement a masked diffusiontransformer (MDT) model for the TTM task, showcasing its capacity for qualitycontrol and enhanced musicality. Furthermore, we introduce a three-stagecaption refinement approach to address low-quality captions' issue. Experimentsshow state-of-the-art (SOTA) performance on benchmark datasets includingMusicCaps and the Song-Describer Dataset with both objective and subjectivemetrics. Demo audio samples are available at https://qa-mdt.github.io/, codeand pretrained checkpoints are open-sourced athttps://github.com/ivcylc/OpenMusic.


Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

HyperAI Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp