8 months ago

Abstract

In the past, the rapidly evolving field of sound classification greatlybenefited from the application of methods from other domains. Today, we observethe trend to fuse domain-specific tasks and approaches together, which providesthe community with new outstanding models. In this work, we present an extension of the CLIP model that handles audio inaddition to text and images. Our proposed model incorporates the ESResNeXtaudio-model into the CLIP framework using the AudioSet dataset. Such acombination enables the proposed model to perform bimodal and unimodalclassification and querying, while keeping CLIP's ability to generalize tounseen datasets in a zero-shot inference fashion. AudioCLIP achieves new state-of-the-art results in the Environmental SoundClassification (ESC) task, out-performing other approaches by reachingaccuracies of 90.07% on the UrbanSound8K and 97.15% on the ESC-50 datasets.Further it sets new baselines in the zero-shot ESC-task on the same datasets(68.78% and 69.40%, respectively). Finally, we also assess the cross-modal querying performance of the proposedmodel as well as the influence of full and partial training on the results. Forthe sake of reproducibility, our code is published.

Source PDF View Code