FusionAudio-1.2M: Towards Fine-grained Audio Captioning with Multimodal Contextual Fusion

Shunian Chen, Xinyuan Xie, Zheshu Chen, Liyan Zhao, Owen Lee, Zhan Su, Qilin Sun, Benyou Wang

公開日: 6/9/2025

FusionAudio-1.2M: Towards Fine-grained Audio Captioning with Multimodal
Contextual Fusion

要約

High-quality, large-scale audio captioning is crucial for advancing audiounderstanding, yet current automated methods often generate captions that lackfine-grained detail and contextual accuracy, primarily due to their reliance onlimited unimodal or superficial multimodal information. Drawing inspirationfrom human auditory perception, which adeptly integrates cross-modal cues andperforms sophisticated auditory scene analysis, we introduce a novel two-stageautomated pipeline. This pipeline first employs specialized pretrained modelsto extract diverse contextual cues (e.g., speech, music, general sounds, andvisual information from associated video). A large language model (LLM) thensynthesizes these rich, multimodal inputs to generate detailed andcontext-aware audio captions. Key contributions of this work include: (1) theproposed scalable method for fine-grained audio caption generation; (2)FusionAudio, a new large-scale dataset comprising 1.2 million such detailedcaptions, combined with 6 million QA pairs; and (3) enhanced audio modelsdeveloped using FusionAudio, specifically a CLAP-based audio encoder withsuperior audio-text alignment and instruction following. This paper paves theway for more nuanced and accurate automated understanding of complex audioenvironments. Code and data can be found inhttps://github.com/satsuki2486441738/FusionAudio.

論文の詳細を見る View Code