Online Tutorial | Real Evaluation of 3 Voice Cloning Models, GPT-SoVITS Accurately Grasps the Characteristics of "Shiji Niangniang"

The box office of the Spring Festival movie "Nezha 2" has been soaring, and has now exceeded 12 billion, becoming the first Chinese film to reach the 10 billion mark, and has successfully entered the top 10 of the global box office list. In the film, the voice actors gave the characters a fresh vitality with their smart voices, from Nezha's "smoky voice" to Taiyi Zhenren's Sichuan dialect, to Shiji Niangniang's smartness, which triggered widespread discussion among the public and brought the behind-the-scenes dubbing art to the forefront.
Speaking of the charm of dubbing art, the Bai Jingjing skin of Mi Yue in "Honor of Kings" is a perfect example. The official invited Wang Huijun, the original voice actor of Bai Jingjing in the movie "A Chinese Odyssey", to voice her again. "You and I must believe that letting go is also a kind of God's will", the familiar lines sounded, and the youthful reluctance of many people was instantly awakened, and players "generously donated" to this sentiment.
Nowadays, voice cloning technology is developing rapidly. Relying on advanced voice cloning models, ordinary people can also transcend time and space, reproduce the unique voice of their favorite characters with one click, and easily satisfy their "dubbing addiction"!Three mainstream open source models, GPT-SoVITS, Fish Speech v1.4, and F5-E2 TTS, stand out.With their respective unique advantages, they play a key role in different application scenarios. Whether it is film and television creation, audio content production, or daily fun dubbing, they can be found.
The "Tutorial" section of HyperAI's official website is now online:
* GPT-SoVITS audio synthesis online demo:
https://hyper.ai/cn/tutorials/29812
* Fish Speech v1.4 Voice Cloning-Text to Speech Tool Demo:
https://hyper.ai/cn/tutorials/34680
* F5-E2 TTS clones any sound in just 3 seconds:
https://hyper.ai/cn/tutorials/35468
Today, I will give you a detailed introduction to these three sound cloning open source models, and use the same original audio and prompt to help you evaluate the actual usage effects!
GPT-SoVITS Audio Synthesis
* Release time:2022
* Issuing Agency:B station up master Huaer Buku
* One-click deployment:
https://hyper.ai/cn/tutorials/29812
The model uses SoVITS+Transformer speech coding technology and caused a sensation in the AI speech synthesis circle as soon as it was launched. Its high-fidelity speech synthesis effect is unique, and even with only a 5-second sound sample, it can achieve zero-sample text-to-speech (TTS) conversion.
Taking the voice of Shiji Niangniang in the movie Nezha as an example, using GPT-SoVITS, we only need to collect an audio sample of Shiji Niangniang’s classic lines in the movie as a sample to accurately reproduce her lovely, lively and powerful voice.
Fish Speech v1.4 Voice Cloning
* Release time:2024
* Issuing Agency:Fish Audio Team
* One-click deployment:
https://hyper.ai/cn/tutorials/34680
The model has been trained with about 150,000 hours of data and is proficient in Chinese, Japanese and English. Its language processing ability is close to that of humans, and its voice expression is rich and varied. Users can freely adjust the timbre, pitch, and speaking speed to easily create their own voice to meet everyone's personalized needs for character voices in different creative scenarios.
F5-E2 TTS clones any sound in just 3 seconds
* Release time:2024
* Issuing Agency:Shanghai Jiao Tong University, University of Cambridge and Geely Automobile Research Institute (Ningbo) Co., Ltd.
* One-click deployment:
https://hyper.ai/cn/tutorials/35468
F5 TTS uses a non-autoregressive generation method based on stream matching, combined with the Diffusion Transformer (DiT) technology, to quickly generate natural, fluent, and faithful speech to the original text through zero-shot learning without additional supervision. The core of E2 TTS lies in its completely non-autoregressive feature. It can generate the entire speech sequence at once without the need for step-by-step generation, which significantly improves the generation speed and maintains high-quality speech output, achieving multi-tone hybrid cloning in 3 seconds.
This model supports 3 functions:
* Single-person voice generation (Batched TTS): Generate text based on uploaded audio.
* Podcast Generation:Simulate a two-person conversation based on two-person audio.
* Multiple Speech-Type Generation:Audios with different emotions can be generated based on the audios of the same speaker with different emotions.
The above is the review of the sound cloning model we prepared for you. If you are interested, come and experience it for yourself!