Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

Recently, instruction-following audio-language models have received broadattention for audio interaction with humans. However, the absence ofpre-trained audio models capable of handling diverse audio types and tasks hashindered progress in this field. Consequently, most existing works have onlybeen able to support a limited range of interaction capabilities. In thispaper, we develop the Qwen-Audio model and address this limitation by scalingup audio-language pre-training to cover over 30 tasks and various audio types,such as human speech, natural sounds, music, and songs, to facilitate universalaudio understanding abilities. However, directly co-training all tasks anddatasets can lead to interference issues, as the textual labels associated withdifferent datasets exhibit considerable variations due to differences in taskfocus, language, granularity of annotation, and text structure. To overcome theone-to-many interference, we carefully design a multi-task training frameworkby conditioning on a sequence of hierarchical tags to the decoder forencouraging knowledge sharing and avoiding interference through shared andspecified tags respectively. Remarkably, Qwen-Audio achieves impressiveperformance across diverse benchmark tasks without requiring any task-specificfine-tuning, surpassing its counterparts. Building upon the capabilities ofQwen-Audio, we further develop Qwen-Audio-Chat, which allows for input fromvarious audios and text inputs, enabling multi-turn dialogues and supportingvarious audio-central scenarios.