Step-Audio-TTS-3B Production-level Dialect Speech Generation Model


1. Tutorial Introduction
Step-Audio is the industry's first product-level open source real-time voice dialogue system that integrates speech understanding and generation control. It was open sourced by the Stepfun-AI team in 2025. It supports multi-language generation (such as Chinese, English, Japanese), voice emotions (such as happiness, sadness), dialects (such as Cantonese, Sichuan dialect), controllable speech speed and rhythmic style, and supports RAP and humming, etc.
本教程以 Step-Audio-TTS-3B 作为演示,算力资源采用「单卡 RTX4090」。
Support functions:
- General speech synthesis
Preset the official website default voice character Tingting and add Nezha's voice, support multi-language generation, emotion, dialect and other settings
- Music Synthesis
Preset the official website default voice character Tingting and add Nezha voice, support RAP and humming
- Voice cloning
Support users to upload custom audio, enter the text content of the audio and define the role name as required
2. Operation steps
1. After starting the container, click the API address to enter the web interface (If "Bad Gateway" is displayed, it means that the model is initializing. Please wait for about 1 minute and try again.)

2. After entering the web page, you can perform multi-functional speech synthesis
1. General speech synthesis

General speech synthesis
2. RAP/Humming mode

RAP / Hum mode
3. Voice cloning

Voice cloning
Tips: You can quickly generate the sound clone effect of RAP or humming by (RAP) or (humming) before the text to be generated.
Exchange and discussion
🖌️ If you see a high-quality project, please leave a message in the background to recommend it! In addition, we have also established a tutorial exchange group. Welcome friends to scan the QR code and remark [SD Tutorial] to join the group to discuss various technical issues and share application effects↓