End-to-End Environmental Sound Classification using a 1D Convolutional Neural Network

In this paper, we present an end-to-end approach for environmental soundclassification based on a 1D Convolution Neural Network (CNN) that learns arepresentation directly from the audio signal. Several convolutional layers areused to capture the signal's fine time structure and learn diverse filters thatare relevant to the classification task. The proposed approach can deal withaudio signals of any length as it splits the signal into overlapped framesusing a sliding window. Different architectures considering several input sizesare evaluated, including the initialization of the first convolutional layerwith a Gammatone filterbank that models the human auditory filter response inthe cochlea. The performance of the proposed end-to-end approach in classifyingenvironmental sounds was assessed on the UrbanSound8k dataset and theexperimental results have shown that it achieves 89% of mean accuracy.Therefore, the propose approach outperforms most of the state-of-the-artapproaches that use handcrafted features or 2D representations as input.Furthermore, the proposed approach has a small number of parameters compared toother architectures found in the literature, which reduces the amount of datarequired for training.