BTS: Bridging Text and Sound Modalities for Metadata-Aided Respiratory Sound Classification

Respiratory sound classification (RSC) is challenging due to varied acousticsignatures, primarily influenced by patient demographics and recordingenvironments. To address this issue, we introduce a text-audio multimodal modelthat utilizes metadata of respiratory sounds, which provides usefulcomplementary information for RSC. Specifically, we fine-tune a pretrainedtext-audio multimodal model using free-text descriptions derived from the soundsamples' metadata which includes the gender and age of patients, type ofrecording devices, and recording location on the patient's body. Our methodachieves state-of-the-art performance on the ICBHI dataset, surpassing theprevious best result by a notable margin of 1.17%. This result validates theeffectiveness of leveraging metadata and respiratory sound samples in enhancingRSC performance. Additionally, we investigate the model performance in the casewhere metadata is partially unavailable, which may occur in real-world clinicalsetting.