Dia-1.6B: Emotional Speech Synthesis Demo
1. Tutorial Introduction
This tutorial uses resources for a single RTX 4090 card and currently only supports English generation.
👉 This project provides a model of:
- Dia – 1.6B: It has a 1.6B parameter text-to-speech model.
2. Project Examples

2. Operation steps
1. After starting the container, click the API address to enter the Web interface
If "Bad Gateway" is displayed, it means the model is initializing. Since the model is large, please wait about 1-2 minutes and refresh the page.

2. After entering the webpage, you can start a conversation with the model
Parameter Description:
- Max New Tokens: Controls the length of the generated audio.
- CFG Scale: Adjust the degree to which the generation complies with the input conditions.
- Temperature: Controls the randomness of the generated results.
- Top P: Control the diversity of candidate selection.
- CFG Filter Top K: Combined with CFG's Top K filtering, balancing relevance and diversity.
- Speed Factor: Adjust the playback speed or generated rhythm of the generated audio.
How to use
Enter the generated text in "Input text". You can use the [S1] and [S2] tags to distinguish the character dialogue. If there are two different timbres in the reference audio, they can be marked as S1 (first timbre) and S2 (second timbre) respectively. The character's timbre will correspond to these two timbres in the reference audio one by one. If there is only one timbre in the reference audio, it can be marked as S1 (first timbre).

Exchange and discussion
🖌️ If you see a high-quality project, please leave a message in the background to recommend it! In addition, we have also established a tutorial exchange group. Welcome friends to scan the QR code and remark [SD Tutorial] to join the group to discuss various technical issues and share application effects↓
