Audio Deepfake Detection with Self-Supervised XLS-R and SLS Classifier

Generative AI technologies, including text-to-speech (TTS) andvoice conversion (VC), frequently become indistinguishable fromgenuine samples, posing challenges for individuals in discerningbetween real and synthetic content. This indistinguishability undermines trust in media, and the arbitrary cloning of personal voicesignals presents significant challenges to privacy and security. Inthe field of deepfake audio detection, the majority of models achieving higher detection accuracy currently employ self-supervisedpre-trained models. However, with the ongoing development ofdeepfake audio generation algorithms, maintaining high discrimination accuracy against new algorithms grows more challenging.To enhance the sensitivity of deepfake audio features, we proposea deepfake audio detection model that incorporates an SLS (Sensitive Layer Selection) module. Specifically, utilizing the pre-trainedXLS-R enables our model to extract diverse audio features from itsvarious layers, each providing distinct discriminative information.Utilizing the SLS classifier, our model captures sensitive contextualinformation across different layer levels of audio features, effectivelyemploying this information for fake audio detection. Experimentalresults show that our method achieves state-of-the-art (SOTA) performance on both the ASVspoof 2021 DF and In-the-Wild datasets,with a specific Equal Error Rate (EER) of 1.92% on the ASVspoof2021 DF dataset and 7.46% on the In-the-Wild dataset. Codes anddata can be found at https://github.com/QiShanZhang/SLSforADD.