F5-TTS: Voice cloning, two-person dialogue, multi-tone mixing

Tutorial Introduction

该教程仅需 RTX 4090 即可启动。

This tutorial includes two demos of the models, F5-TTS and E2 TTS.

F5-TTS is a high-performance text-to-speech (TTS) system jointly open-sourced by Shanghai Jiao Tong University, Cambridge University and Geely Automobile Research Institute (Ningbo) Co., Ltd. in 2024. It is based on a non-autoregressive generation method based on stream matching and combines the diffusion transformer (DiT) technology.F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching". This system can quickly generate natural, fluent and faithful speech to the original text through zero-shot learning without additional supervision. F5-TTS supports multi-language synthesis, including Chinese and English, and can perform effective speech synthesis on long texts. In addition, F5-TTS also has an emotion control function, which can adjust the emotional expression of the synthesized speech according to the text content, and supports speed control, allowing users to adjust the playback speed of the speech as needed. The system is trained on a large-scale data set of 100,000 hours, showing excellent performance and generalization capabilities. The main functions of F5-TTS include zero-shot sound cloning, speed control, emotion expression control, long text synthesis and multi-language support. Its technical principles involve stream matching, diffusion transformer (DiT), ConvNeXt V2 text representation improvement, Sway Sampling strategy and end-to-end system design. F5-TTS has a wide range of application scenarios, including audiobooks, voice assistants, language learning, news broadcasting, game dubbing, etc., providing powerful speech synthesis capabilities for various commercial and non-commercial purposes.

E2 TTS, short for "Embarrassingly Easy Text-to-Speech", is an advanced text-to-speech (TTS) system that achieves human-level naturalness and speaker similarity through a simplified process. The core of E2 TTS is its completely non-autoregressive nature, which means that it can generate the entire speech sequence at once without the need for step-by-step generation, significantly increasing the generation speed while maintaining high-quality speech output.E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS”, has been accepted by SLT 2024. In the E2 TTS framework, text input is converted into a sequence of characters with filling tokens. A stream matching based mel-spectrogram generator is then trained based on the audio filling task. Unlike many previous works, it does not require additional components (e.g. duration models, graphemes to phonemes) or complex techniques (e.g. monotonic alignment search). Despite its simplicity, E2 TTS achieves state-of-the-art zero-shot TTS capabilities, comparable to or surpassing previous works including Voicebox and NaturalSpeech 3. The simplicity of E2 TTS also allows flexibility in input representation.

该教程支持如下模型和功能：

2 个模型检查点：
- F5-TTS
- E2 TTS

3 个功能：
- 单人语音生成（Batched TTS）: 根据上传的音频进行文本生成。
- 双人语音生成（Podcast Generation）：根据双人音频模拟双人对话。
- 多种语音类型生成（Multiple Speech-Type Generation）：可根据同一讲话人不同情绪下的音频，生成不同情绪的音频。

Run steps

After starting the container, click the API address to enter the Web interface

1. Batched TTS

Select the TTS function, upload the audio and text prompts as required, and set advanced parameters as needed.

Audio: Upload a clear, high-quality audio clip of a single person speaking, and the model will imitate the audio clip for generation.
Text prompt word: The text to generate.

Advanced Parameters

Reference Text: Leave blank to automatically transcribe the reference audio. If you enter text, it will override the automatic transcription.
Remove Silences: The model tends to produce silence, especially on longer audio. We can remove silence manually if needed. Please note that this is an experimental feature and may produce strange results. This will also increase generation time.
Custom Split Words: Enter custom words to split, separated by commas. Leave blank to use the default list.
speed: Control the speed of generated speech

As shown in the figure below

2. Podcast Generation

choose Podcast Generation Function, upload multi-person audio and text prompts as required below. This function uses the model to imitate the conversation between two people, and requires the names and audio of two people.

Audio: Upload two clear, high-quality speech audios separately, and the model will imitate the audio for generation.
Reference Text: Defaults to leaving it blank to automatically transcribe the reference audio. If you enter text, it will override the automatic transcription.
Select Model: Default is F5-TTS

As shown in the figure below

3. Multiple Speech-Type Generation

Select the Multiple Speech-Type Generation function and upload audio and text prompts of different emotions as required below. This function uses the model to simulate emotions and generates audio based on different emotions.

Audio: Upload multiple clear, high-quality audio clips with different emotions, and the model will imitate the audio to generate.
Reference Text: Defaults to leaving it blank to automatically transcribe the reference audio. If you enter text, it will override the automatic transcription.
Select Model: Default is F5-TTS

For example, upload five audio clips, Regular, Surprised, Sad, Angry, Whisper, Shouting, to generate text:

(Regular) Hello, I'd like to order a sandwich please. (Surprised) What do you mean you're out of bread? (Sad) I really wanted a sandwich though… (Angry) You know what, darn you and your little shop, you suck! (Whisper) I'll just go back home and cry now. (Shouting) Why me?!

You can generate a speech with different emotions as follows

Exchange and discussion

🖌️ If you see a high-quality project, please leave a message in the background to recommend it! In addition, we have also established a tutorial exchange group. Welcome friends to scan the QR code and remark [SD Tutorial] to join the group to discuss various technical issues and share application effects↓