8 months ago

Abstract

There are growing implications surrounding generative AI in the speech domainthat enable voice cloning and real-time voice conversion from one individual toanother. This technology poses a significant ethical threat and could lead tobreaches of privacy and misrepresentation, thus there is an urgent need forreal-time detection of AI-generated speech for DeepFake Voice Conversion. Toaddress the above emerging issues, the DEEP-VOICE dataset is generated in thisstudy, comprised of real human speech from eight well-known figures and theirspeech converted to one another using Retrieval-based Voice Conversion.Presenting as a binary classification problem of whether the speech is real orAI-generated, statistical analysis of temporal audio features through t-testingreveals that there are significantly different distributions. Hyperparameteroptimisation is implemented for machine learning models to identify the sourceof speech. Following the training of 208 individual machine learning modelsover 10-fold cross validation, it is found that the Extreme Gradient Boostingmodel can achieve an average classification accuracy of 99.3% and can classifyspeech in real-time, at around 0.004 milliseconds given one second of speech.All data generated for this study is released publicly for future research onAI speech detection.

Source PDF View Code