8 months ago

Abstract

In this work, we further develop the conformer-based metric generativeadversarial network (CMGAN) model for speech enhancement (SE) in thetime-frequency (TF) domain. This paper builds on our previous work but takes amore in-depth look by conducting extensive ablation studies on model inputs andarchitectural design choices. We rigorously tested the generalization abilityof the model to unseen noise types and distortions. We have fortified ourclaims through DNS-MOS measurements and listening tests. Rather than focusingexclusively on the speech denoising task, we extend this work to address thedereverberation and super-resolution tasks. This necessitated exploring variousarchitectural changes, specifically metric discriminator scores and maskingtechniques. It is essential to highlight that this is among the earliest worksthat attempted complex TF-domain super-resolution. Our findings show that CMGANoutperforms existing state-of-the-art methods in the three major speechenhancement tasks: denoising, dereverberation, and super-resolution. Forexample, in the denoising task using the Voice Bank+DEMAND dataset, CMGANnotably exceeded the performance of prior models, attaining a PESQ score of3.41 and an SSNR of 11.10 dB. Audio samples and CMGAN implementations areavailable online.

Source PDF