Real-Time Target Sound Extraction

We present the first neural network model to achieve real-time and streamingtarget sound extraction. To accomplish this, we propose Waveformer, anencoder-decoder architecture with a stack of dilated causal convolution layersas the encoder, and a transformer decoder layer as the decoder. This hybridarchitecture uses dilated causal convolutions for processing large receptivefields in a computationally efficient manner while also leveraging thegeneralization performance of transformer-based architectures. Our evaluationsshow as much as 2.2-3.3 dB improvement in SI-SNRi compared to the prior modelsfor this task while having a 1.2-4x smaller model size and a 1.5-2x lowerruntime. We provide code, dataset, and audio samples:https://waveformer.cs.washington.edu/.