2 months ago

UniAudio: An Audio Foundation Model Toward Universal Audio Generation

Yang, Dongchao ; Tian, Jinchuan ; Tan, Xu ; Huang, Rongjie ; Liu, Songxiang ; Chang, Xuankai ; Shi, Jiatong ; Zhao, Sheng ; Bian, Jiang ; Zhao, Zhou ; Wu, Xixin ; Meng, Helen

View Paper Details

UniAudio: An Audio Foundation Model Toward Universal Audio Generation

Abstract

Large Language models (LLM) have demonstrated the capability to handle avariety of generative tasks. This paper presents the UniAudio system, which,unlike prior task-specific approaches, leverages LLM techniques to generatemultiple types of audio (including speech, sounds, music, and singing) withgiven input conditions. UniAudio 1) first tokenizes all types of target audioalong with other condition modalities, 2) concatenates source-target pair as asingle sequence, and 3) performs next-token prediction using LLM. Also, amulti-scale Transformer model is proposed to handle the overly long sequencescaused by the residual vector quantization based neural codec in tokenization.Training of UniAudio is scaled up to 165K hours of audio and 1B parameters,based on all generative tasks, aiming to obtain sufficient prior knowledge notonly in the intrinsic properties of audio but also the inter-relationshipbetween audio and other modalities. Therefore, the trained UniAudio model hasthe potential to become a foundation model for universal audio generation: itshows strong capability in all trained tasks and can seamlessly support newaudio generation tasks after simple fine-tuning. Experiments demonstrate thatUniAudio achieves state-of-the-art or at least competitive results on most ofthe 11 tasks. Demo and code are released athttps://github.com/yangdongchao/UniAudio