xLSTM-SENet: xLSTM for Single-Channel Speech Enhancement

While attention-based architectures, such as Conformers, excel in speechenhancement, they face challenges such as scalability with respect to inputsequence length. In contrast, the recently proposed Extended Long Short-TermMemory (xLSTM) architecture offers linear scalability. However, xLSTM-basedmodels remain unexplored for speech enhancement. This paper introducesxLSTM-SENet, the first xLSTM-based single-channel speech enhancement system. Acomparative analysis reveals that xLSTM-and notably, even LSTM-can match oroutperform state-of-the-art Mamba- and Conformer-based systems across variousmodel sizes in speech enhancement on the VoiceBank+Demand dataset. Throughablation studies, we identify key architectural design choices such asexponential gating and bidirectionality contributing to its effectiveness. Ourbest xLSTM-based model, xLSTM-SENet2, outperforms state-of-the-art Mamba- andConformer-based systems of similar complexity on the Voicebank+DEMAND dataset.