Online Tutorial: How Does Shiji Niangniang Instantly Become a "Sichuan and Chongqing Girl"? Step-Audio-TTS Realizes Voice cloning/music synthesis/speech Synthesis in One

a year ago

The global enthusiasm caused by DeepSeek open source is still there. Recently, Step Star and Geely Auto Group have once again made a move and open-sourced the Step-Audio-TTS-3B model, which has once again triggered widespread discussion in the industry.

once Upon a time,The diversity and complexity of dialect data and the high demand for model generalization make the voice cloning model perform poorly on dialects.Step-Audio-TTS-3B can vividly interpret the characteristics of local languages. It is trained based on the large-scale synthetic data set of the LLM-Chat paradigm, deeply understands the structure of the language, and grasps the subtle changes of the language from the lines. Whether it is the passionate Sichuan dialect or the nine-tone and six-tone Cantonese, it can accurately grasp its rhythm and tone, showing the strong local customs.

Not only that, it is also the first TTS model to realize RAP and humming generation, filling the gap in music speech synthesis. In the past, creating a rhythmic RAP content required professional singers. But now, with the help of Step-Audio-TTS-3B, users can quickly generate a RAP vocal with precise rhythm and smooth flow, inspiring unlimited possibilities.

Currently, the "Step-Audio-TTS-3B production-level dialect speech generation model" has been launched in the "Tutorial" section of HyperAI's official website.This tutorial includes three functions: speech synthesis, music synthesis, and voice cloning. Come and experience it yourself~

Tutorial address:

https://go.hyper.ai/QrTmW

Demo Run

1. Log in to hyper.ai, on the Tutorial page, select Step-Audio-TTS-3B Production-Level Dialect Speech Generation Model, and click Run this tutorial online.

2. After the page jumps, click "Clone" in the upper right corner to clone the tutorial into your own container.

3. Select "NVIDIA RTX A6000" and "PyTorch" images. The OpenBayes platform has launched a new billing method. You can choose "pay as you go" or "daily/weekly/monthly" according to your needs. Click "Continue". New users can register using the invitation link below to get 4 hours of RTX 4090 + 5 hours of CPU free time!

HyperAI exclusive invitation link (copy and open in browser):

https://openbayes.com/console/signup?r=Ada0322_QZy7

4. Wait for resources to be allocated. The first clone will take about 2 minutes. When the status changes to "Running", click the jump arrow next to "API Address" to jump to the Demo page. Please note that users must complete real-name authentication before using the API address access function.

Effect display

This tutorial includes three functions: general speech synthesis, music synthesis and speech cloning.

1. General speech synthesis

This feature presets the official default voice character Tingting and the newly added voice Nezha, and supports multi-language generation, emotions, dialects and other settings.

Speech synthesis tone description

* The sound Tingting is generated by the official 4s audio prompt file

* The sound of Nezha is generated from the 14s audio prompt "I am Nezha the Third Prince, I am uninhibited and love to write poetry, I walk with my hands in my pockets, and I can make a curved road straight" file

On the Demo page, select "Normal Speech Synthesis", enter text, select the speaker (default is Tingting), select the emotion (happy, angry, sad and coquettish), select the language/dialect (Chinese, English, Japanese, Mandarin, Sichuanese, Cantonese and Guangdong dialect), select the speaking speed (fast or slow). Click "Generate Speech".

2. Music Synthesis

This function presets the official website's default voice character Tingting and the newly added Nezha timbre, and supports RAP and humming.

RAP Sound Description

* The sound Tingting is generated by the official 11s audio prompt file

* The sound of Nezha is generated by the 14s audio prompt "I am so scared of the thunder rolling in the sky, it strikes me all over, I blow the trumpet to change my fate, I laugh to get through the calamity, tick-tick-tick-tick-tick" file

Humming Tone Description

* The sound Tingting is generated by a 12s audio prompt file

* The sound of Nezha is generated by the 14s audio prompt "I am born fearless, no matter who is my father or whoever, if the master takes out the ruler, he will never be able to command me" file

Select "Music Synthesis" on the Demo page, enter the text, select the speaker (default is Tingting), select the mode (RAP or Humming), and click "Generate RAP / Humming".

3. Voice cloning

This function supports users to upload custom-timbre audio and generate personalized voice.

Select "Voice Cloning" on the Demo page, enter text, upload reference audio (.wav format), name the cloned voice, select emotion (happy, angry, sad and coquettish), select language/dialect (Chinese, English, Japanese, Mandarin, Sichuanese, Cantonese and Guangdong dialect), select speech speed (fast or slow). Click "Generate Clone Voice".

Online Tutorial: How Does Shiji Niangniang Instantly Become a "Sichuan and Chongqing Girl"? Step-Audio-TTS Realizes Voice cloning/music synthesis/speech Synthesis in One

a year ago

Information

Artificial Intelligence

Deep Learning

Tutorial address:

https://go.hyper.ai/QrTmW

Demo Run

1. Log in to hyper.ai, on the Tutorial page, select Step-Audio-TTS-3B Production-Level Dialect Speech Generation Model, and click Run this tutorial online.

2. After the page jumps, click "Clone" in the upper right corner to clone the tutorial into your own container.

HyperAI exclusive invitation link (copy and open in browser):

https://openbayes.com/console/signup?r=Ada0322_QZy7

Effect display

This tutorial includes three functions: general speech synthesis, music synthesis and speech cloning.

1. General speech synthesis

This feature presets the official default voice character Tingting and the newly added voice Nezha, and supports multi-language generation, emotions, dialects and other settings.

Speech synthesis tone description

* The sound Tingting is generated by the official 4s audio prompt file

2. Music Synthesis

This function presets the official website's default voice character Tingting and the newly added Nezha timbre, and supports RAP and humming.

RAP Sound Description

* The sound Tingting is generated by the official 11s audio prompt file

Humming Tone Description

* The sound Tingting is generated by a 12s audio prompt file

* The sound of Nezha is generated by the 14s audio prompt "I am born fearless, no matter who is my father or whoever, if the master takes out the ruler, he will never be able to command me" file

Select "Music Synthesis" on the Demo page, enter the text, select the speaker (default is Tingting), select the mode (RAP or Humming), and click "Generate RAP / Humming".

3. Voice cloning

This function supports users to upload custom-timbre audio and generate personalized voice.

Command Palette

Online Tutorial: How Does Shiji Niangniang Instantly Become a "Sichuan and Chongqing Girl"? Step-Audio-TTS Realizes Voice cloning/music synthesis/speech Synthesis in One

Demo Run

Effect display

Command Palette

Online Tutorial: How Does Shiji Niangniang Instantly Become a "Sichuan and Chongqing Girl"? Step-Audio-TTS Realizes Voice cloning/music synthesis/speech Synthesis in One

Demo Run

Effect display

Related News

Can Emojis Control Speech Generation? Irodori-TTS Is a Japanese TTS Based on the RF-DiT Architecture; Eczema and Tinea Skin Disease Datasets: Supporting Medical Image Classification and Transfer learning.

Achieve "voice-over Freedom" With Just 3 Seconds of Audio: Mistral open-source Speech Model Voxtral-4B-TTS-2603; Set a New Benchmark for Data Quality: Sutra 10B Pretraining.

Zero-sampling TTS Breakthrough! A Few Seconds of Reference Audio, OmniVoice Helps You Easily Clone Hundreds of Languages; 17 Languages All in One Go: MDPbench Solves the Major Problem of Parsing low-resource Text systems.

Online Tutorial | Supports 600+ Languages, Xiaomi Open Sources OmniVoice: Achieve Voice Cloning With Just 3-10 Seconds of Reference Audio

Fast and Accurate! Cohere Releases open-source Transcription Model; Accurate Parsing of Complex Scenarios: Chandra-ocr-2 Visual Language Model Achieves Precise OCR.

Free CPU Tutorial | Achieving 8.8k Stars, the Supertonic-3 TTS Model Has Only About 99M Parameters and Supports 31 languages.

Tencent open-sources Hy-MT1.5 Translation Model: 440MB Achieves top-tier Translation Capabilities; MIT Jointly Releases MathNet: a Multimodal Mathematical Inference Benchmark Covering 27,000 Real Olympiad Math problems.

A Locally Runnable Privacy Detection Model: Privacy Filter Achieves high-quality PII Filtering at Low Cost; Hardcore Open Source! Covering the Transfermarkt Structured Football Dataset With Over 80,000 matches.

Online Tutorial | HKU Team Open Sources DeepTutor, a Personal Learning Assistant That Enables Interactive Learning Covering Understanding, Reasoning, and Generation Through Multi-Agent Collaboration

Command Palette

Online Tutorial: How Does Shiji Niangniang Instantly Become a "Sichuan and Chongqing Girl"? Step-Audio-TTS Realizes Voice cloning/music synthesis/speech Synthesis in One

Demo Run

Effect display

Related News

Can Emojis Control Speech Generation? Irodori-TTS Is a Japanese TTS Based on the RF-DiT Architecture; Eczema and Tinea Skin Disease Datasets: Supporting Medical Image Classification and Transfer learning.

Achieve "voice-over Freedom" With Just 3 Seconds of Audio: Mistral open-source Speech Model Voxtral-4B-TTS-2603; Set a New Benchmark for Data Quality: Sutra 10B Pretraining.

Zero-sampling TTS Breakthrough! A Few Seconds of Reference Audio, OmniVoice Helps You Easily Clone Hundreds of Languages; 17 Languages All in One Go: MDPbench Solves the Major Problem of Parsing low-resource Text systems.

Online Tutorial | Supports 600+ Languages, Xiaomi Open Sources OmniVoice: Achieve Voice Cloning With Just 3-10 Seconds of Reference Audio

Fast and Accurate! Cohere Releases open-source Transcription Model; Accurate Parsing of Complex Scenarios: Chandra-ocr-2 Visual Language Model Achieves Precise OCR.

Free CPU Tutorial | Achieving 8.8k Stars, the Supertonic-3 TTS Model Has Only About 99M Parameters and Supports 31 languages.

Tencent open-sources Hy-MT1.5 Translation Model: 440MB Achieves top-tier Translation Capabilities; MIT Jointly Releases MathNet: a Multimodal Mathematical Inference Benchmark Covering 27,000 Real Olympiad Math problems.

A Locally Runnable Privacy Detection Model: Privacy Filter Achieves high-quality PII Filtering at Low Cost; Hardcore Open Source! Covering the Transfermarkt Structured Football Dataset With Over 80,000 matches.

Online Tutorial | HKU Team Open Sources DeepTutor, a Personal Learning Assistant That Enables Interactive Learning Covering Understanding, Reasoning, and Generation Through Multi-Agent Collaboration

Related News

Can Emojis Control Speech Generation? Irodori-TTS Is a Japanese TTS Based on the RF-DiT Architecture; Eczema and Tinea Skin Disease Datasets: Supporting Medical Image Classification and Transfer learning.

Achieve "voice-over Freedom" With Just 3 Seconds of Audio: Mistral open-source Speech Model Voxtral-4B-TTS-2603; Set a New Benchmark for Data Quality: Sutra 10B Pretraining.

Zero-sampling TTS Breakthrough! A Few Seconds of Reference Audio, OmniVoice Helps You Easily Clone Hundreds of Languages; 17 Languages All in One Go: MDPbench Solves the Major Problem of Parsing low-resource Text systems.

Online Tutorial | Supports 600+ Languages, Xiaomi Open Sources OmniVoice: Achieve Voice Cloning With Just 3-10 Seconds of Reference Audio

Fast and Accurate! Cohere Releases open-source Transcription Model; Accurate Parsing of Complex Scenarios: Chandra-ocr-2 Visual Language Model Achieves Precise OCR.

Free CPU Tutorial | Achieving 8.8k Stars, the Supertonic-3 TTS Model Has Only About 99M Parameters and Supports 31 languages.

Tencent open-sources Hy-MT1.5 Translation Model: 440MB Achieves top-tier Translation Capabilities; MIT Jointly Releases MathNet: a Multimodal Mathematical Inference Benchmark Covering 27,000 Real Olympiad Math problems.

A Locally Runnable Privacy Detection Model: Privacy Filter Achieves high-quality PII Filtering at Low Cost; Hardcore Open Source! Covering the Transfermarkt Structured Football Dataset With Over 80,000 matches.

Online Tutorial | HKU Team Open Sources DeepTutor, a Personal Learning Assistant That Enables Interactive Learning Covering Understanding, Reasoning, and Generation Through Multi-Agent Collaboration

Related News

Can Emojis Control Speech Generation? Irodori-TTS Is a Japanese TTS Based on the RF-DiT Architecture; Eczema and Tinea Skin Disease Datasets: Supporting Medical Image Classification and Transfer learning.

Achieve "voice-over Freedom" With Just 3 Seconds of Audio: Mistral open-source Speech Model Voxtral-4B-TTS-2603; Set a New Benchmark for Data Quality: Sutra 10B Pretraining.

Zero-sampling TTS Breakthrough! A Few Seconds of Reference Audio, OmniVoice Helps You Easily Clone Hundreds of Languages; 17 Languages All in One Go: MDPbench Solves the Major Problem of Parsing low-resource Text systems.

Online Tutorial | Supports 600+ Languages, Xiaomi Open Sources OmniVoice: Achieve Voice Cloning With Just 3-10 Seconds of Reference Audio

Fast and Accurate! Cohere Releases open-source Transcription Model; Accurate Parsing of Complex Scenarios: Chandra-ocr-2 Visual Language Model Achieves Precise OCR.

Free CPU Tutorial | Achieving 8.8k Stars, the Supertonic-3 TTS Model Has Only About 99M Parameters and Supports 31 languages.

Tencent open-sources Hy-MT1.5 Translation Model: 440MB Achieves top-tier Translation Capabilities; MIT Jointly Releases MathNet: a Multimodal Mathematical Inference Benchmark Covering 27,000 Real Olympiad Math problems.

A Locally Runnable Privacy Detection Model: Privacy Filter Achieves high-quality PII Filtering at Low Cost; Hardcore Open Source! Covering the Transfermarkt Structured Football Dataset With Over 80,000 matches.

Online Tutorial | HKU Team Open Sources DeepTutor, a Personal Learning Assistant That Enables Interactive Learning Covering Understanding, Reasoning, and Generation Through Multi-Agent Collaboration