2 months ago

Dynamic Convolutional Neural Networks as Efficient Pre-trained Audio Models

Schmid, Florian ; Koutini, Khaled ; Widmer, Gerhard

Abstract

The introduction of large-scale audio datasets, such as AudioSet, paved theway for Transformers to conquer the audio domain and replace CNNs as thestate-of-the-art neural network architecture for many tasks. Audio SpectrogramTransformers are excellent at exploiting large datasets, creating powerfulpre-trained models that surpass CNNs when fine-tuned on downstream tasks.However, current popular Audio Spectrogram Transformers are demanding in termsof computational complexity compared to CNNs. Recently, we have shown that, byemploying Transformer-to-CNN Knowledge Distillation, efficient CNNs can catchup with and even outperform Transformers on large datasets. In this work, weextend this line of research and increase the capacity of efficient CNNs byintroducing dynamic CNN blocks, constructed of dynamic non-linearities, dynamicconvolutions and attention mechanisms. We show that these dynamic CNNsoutperform traditional efficient CNNs, in terms of the performance-complexitytrade-off and parameter efficiency, at the task of audio tagging on thelarge-scale AudioSet. Our experiments further indicate that the introduceddynamic CNNs achieve better performance on downstream tasks and scale up well,attaining Transformer performance and even outperforming them on AudioSet andseveral downstream tasks.