Dia-1.6B: Emotional Speech Synthesis Demo
1. Tutorial Introduction
Dia-1.6B is a text-to-speech model released by the Nari Labs team on April 21, 2025. It can generate highly realistic conversations directly from text scripts and supports audio-based emotion and tone control. In addition, Dia-1.6B can also generate non-verbal communication sounds such as laughter, coughs, throat clearing, etc., making the conversation more natural and vivid. This model not only supports multi-role dialogue generation, but also distinguishes roles through labels such as [S1] and [S2], generates multi-role dialogues in a single shot, and maintains natural rhythm and emotional transitions. This project also supports uploading your own audio samples. The model will generate similar voices based on the samples to achieve zero-sample voiceprint cloning.
This tutorial uses resources for a single RTX 4090 card and currently only supports English generation.
👉 This project provides a model of:
- Dia – 1.6B: It has a 1.6B parameter text-to-speech model.
2. Project Examples

2. Operation steps
1. After starting the container, click the API address to enter the Web interface
If "Bad Gateway" is displayed, it means the model is initializing. Since the model is large, please wait about 1-2 minutes and refresh the page.

2. After entering the webpage, you can start a conversation with the model
Parameter Description:
- Max New Tokens: Controls the length of the generated audio.
- CFG Scale: Adjust the degree to which the generation complies with the input conditions.
- Temperature: Controls the randomness of the generated results.
- Top P: Control the diversity of candidate selection.
- CFG Filter Top K: Combined with CFG's Top K filtering, balancing relevance and diversity.
- Speed Factor: Adjust the playback speed or generated rhythm of the generated audio.
How to use
Enter the generated text in "Input text". You can use the [S1] and [S2] tags to distinguish the character dialogue. If there are two different timbres in the reference audio, they can be marked as S1 (first timbre) and S2 (second timbre) respectively. The character's timbre will correspond to these two timbres in the reference audio one by one. If there is only one timbre in the reference audio, it can be marked as S1 (first timbre).

Exchange and discussion
🖌️ If you see a high-quality project, please leave a message in the background to recommend it! In addition, we have also established a tutorial exchange group. Welcome friends to scan the QR code and remark [SD Tutorial] to join the group to discuss various technical issues and share application effects↓
