2 months ago

CLAPSep: Leveraging Contrastive Pre-trained Model for Multi-Modal Query-Conditioned Target Sound Extraction

Ma, Hao ; Peng, Zhiyuan ; Li, Xu ; Shao, Mingjie ; Wu, Xixin ; Liu, Ju

Abstract

Universal sound separation (USS) aims to extract arbitrary types of soundsfrom real-world recordings. This can be achieved by language-queried targetsound extraction (TSE), which typically consists of two components: a querynetwork that converts user queries into conditional embeddings, and aseparation network that extracts the target sound accordingly. Existing methodscommonly train models from scratch. As a consequence, substantial data andcomputational resources are required to make the randomly initialized modelcomprehend sound events and perform separation accordingly. In this paper, wepropose to integrate pre-trained models into TSE models to address the aboveissue. To be specific, we tailor and adapt the powerful contrastivelanguage-audio pre-trained model (CLAP) for USS, denoted as CLAPSep. CLAPSepalso accepts flexible user inputs, taking both positive and negative userprompts of uni- and/or multi-modalities for target sound extraction. These keyfeatures of CLAPSep can not only enhance the extraction performance but alsoimprove the versatility of its application. We provide extensive experiments on5 diverse datasets to demonstrate the superior performance and zero- andfew-shot generalizability of our proposed CLAPSep with fast trainingconvergence, surpassing previous methods by a significant margin. Full codesand some audio examples are released for reproduction and evaluation.