HyperAI

Who Wants to Live in a Future Where Your Voice Is Perfectly Imitated by AI?

7 years ago
Information
Gabriel
特色图像

Don't worry, the technology isn't very convincing yet...
Uh, but I still feel a little uneasy.

AI software can mimic someone’s voice like a starling just by listening to it a few times, according to a paper published by researchers at Baidu.

If the technology is perfected, it could be used to generate fake audio clips in which people say things they never actually said.

Does this make you feel a little creepy?

Baidu's AI team is well-known for its work developing realistic speech, and its latest research project, released recently, shows how a model can learn the characteristics of a person's voice and generate content that the person never said.

Still, the best clips produced from the model were very noisy and lower quality than the original speech. But the "neural cloning system" developed by the researchers managed to retain the British accent and sound fairly similar.

There are two different approaches to building a neural cloning system: speaker adaptation and speaker encoding.

Spoken language adaptation involves training the model with different people speaking in different voices. The team did this using the LibriSpeech database, which contains 2,484 different sound sources. The system learns to extract features from people’s speech to mimic the subtle details of their pronunciation and rhythm.

Spoken speech encoding techniques involve training a model to learn specific speech embeddings from a population of speakers and reproducing the audio samples in a separate system that has been previously trained on many people.

After training on LibriSpeech, call up ten audio samples of an arbitrary speaker from another database. The VCTK dataset contains clips of 109 native English speakers with different accents. Basically, after training on the LibriSpeech dataset, you have to copy new voices from the VCTK dataset.

Compared with speaker adaptation, spoken language encoding is easier to implement in real life applications such as digital assistants, said Sercan Arik, a co-author of the paper and a research scientist at Baidu Research.

“Spoken language adaptation requires the user to read specific utterances from a given text, whereas speaker encoding is a random utterance. This means spoken language adaptation will not be used on consumer devices in the short term, as it is more challenging to scale to a wider user base. In contrast, speaker encoding is easier to deploy, as it is fast and has low memory requirements — it can even be deployed on smartphones.”

The industry is very concerned about whether AI technology will be manipulated and spread false information.

Baidu's latest research shows that while it is possible to produce fake speech, current performance is not good enough to fool humans.

More diverse datasets are one way to improve the end result, and the voice cloning deep learning models themselves still have some room for improvement.

But it's not all bad news. Voice cloning technology can actually do a lot of good.

A mother can configure an audiobook reader with her own voice to read bedtime stories to her child when she cannot read to the child in person.

However, as this technology continues to improve and become more prevalent, we do need to take precautions to ensure this technology is not exploited and used as intended.

Translated from: Katyanna Quach's blog: https://www.theregister.co.uk/2018/02/22/ai_human_voice_cloning/