MAT-SED: A Masked Audio Transformer with Masked-Reconstruction Based Pre-training for Sound Event Detection

Sound event detection (SED) methods that leverage a large pre-trainedTransformer encoder network have shown promising performance in recent DCASEchallenges. However, they still rely on an RNN-based context network to modeltemporal dependencies, largely due to the scarcity of labeled data. In thiswork, we propose a pure Transformer-based SED model with masked-reconstructionbased pre-training, termed MAT-SED. Specifically, a Transformer with relativepositional encoding is first designed as the context network, pre-trained bythe masked-reconstruction task on all available target data in aself-supervised way. Both the encoder and the context network are jointlyfine-tuned in a semi-supervised manner. Furthermore, a global-local featurefusion strategy is proposed to enhance the localization capability. Evaluationof MAT-SED on DCASE2023 task4 surpasses state-of-the-art performance, achieving0.587/0.896 PSDS1/PSDS2 respectively.