Online Tutorial | Microsoft Open Sources VibeVoice, Enabling 90 Minutes of Natural Dialogue Between 4 Roles

7 months ago

In recent years, text-to-speech (TTS) synthesis technology has made significant progress, enabling the synthesis of high-fidelity, naturally-sounding short utterances for a single speaker. However, significant challenges remain in the scalable synthesis of long-format, multi-speaker dialogue audio, limiting its application in scenarios such as podcasts and multi-role audiobooks.

Traditional methods, even when generating such audio by concatenating independently synthesized utterances, still fall short in achieving natural dialogue turn-taking and content-aware generation. With the increasing demands of industry applications, research on multi-speaker long conversation speech generation has emerged in various sectors.However, most of the results have not yet been open-sourced, or there are still unresolved issues regarding the generation length and stability.

In this context,Microsoft has open-sourced VibeVoice, aiming to enable scalable long-format, multi-speaker speech synthesis. VibeVoice employs a next-token diffusion approach to synthesize long multi-speaker speech, a unified method that uses diffusion autoregression to generate latent vectors to model continuous data.

To this end, the research team pioneered a novel continuous speech segmenter that, compared with the currently popular Encoder model, achieves an 80-fold data compression improvement while maintaining comparable performance, resulting in a compression rate of up to 3200× (corresponding to a 7.5 Hz frame rate). This significantly improves the computational efficiency of long sequence processing while ensuring audio fidelity.

Despite its simple architecture, VibeVoice demonstrates exceptional capabilities.It can synthesize up to 90 minutes of speech with up to four speakers within a 64K context window, producing richer timbre, more natural intonation, and capturing the atmosphere of a real conversation.It demonstrates stronger transferability in cross-language applications, and its overall performance surpasses existing open-source and proprietary dialogue models.

As the year draws to a close, this article uses VibeVoice to generate a 1:20 long audio clip of New Year's greetings. The generated quality is significantly improved, moving away from the monotonous "mechanical sound" and presenting a full and layered timbre with emotional tension, making it sound warm and vivid.

"VibeVoice-Realtime TTS: Real-time Speech Synthesis Service" is now available on the tutorial section of the HyperAI website (hyper.ai). You can deploy and experience it with just one click!

Tutorial Link:

https://go.hyper.ai/jdZrA

Demo Run

1. After entering the hyper.ai homepage, select "VibeVoice-Realtime TTS: Real-time Speech Synthesis Service", or select it from the "Tutorials" page. Then click "Run this tutorial online".

2. After the page redirects, click "Clone" in the upper right corner to clone the tutorial into your own container.

Note: You can switch languages in the upper right corner of the page. Currently, Chinese and English are available. This tutorial will show the steps in English.

3. Select the "NVIDIA GeForce RTX 5090" and "PyTorch" images, and choose "Pay As You Go" or "Daily Plan/Weekly Plan/Monthly Plan" as needed, then click "Continue job execution".

HyperAI is offering a registration bonus for new users: for just $1, you can get 5 hours of RTX 5090 computing power (originally priced at $2.45), and the resources are valid indefinitely.

4. Wait for resource allocation. The first clone will take approximately 3 minutes. Once the status changes to "Running", click the jump arrow next to "API address" to go to the Demo page.

Effect Demonstration

After entering the Demo running page, upload your test video, enter text in the "Text to Convert" field, and choose from 7 selectable voice timbres in the "Speaker Voice" option. Adjusting the "CFG Scale" controls the intensity of the speech style; a higher value indicates stronger emotion. Finally, click "Generate Speech," and wait a moment for the audio to be generated.

As the year draws to a close, click to play VibeVoice's New Year's greetings for you!

The above is the tutorial recommended by HyperAI this time. Everyone is welcome to come and experience it!

Tutorial Link:

https://go.hyper.ai/jdZrA

Online Tutorial | Microsoft Open Sources VibeVoice, Enabling 90 Minutes of Natural Dialogue Between 4 Roles

7 months ago

Information

Artificial Intelligence

"VibeVoice-Realtime TTS: Real-time Speech Synthesis Service" is now available on the tutorial section of the HyperAI website (hyper.ai). You can deploy and experience it with just one click!

Tutorial Link:

https://go.hyper.ai/jdZrA

Demo Run

1. After entering the hyper.ai homepage, select "VibeVoice-Realtime TTS: Real-time Speech Synthesis Service", or select it from the "Tutorials" page. Then click "Run this tutorial online".

2. After the page redirects, click "Clone" in the upper right corner to clone the tutorial into your own container.

Note: You can switch languages in the upper right corner of the page. Currently, Chinese and English are available. This tutorial will show the steps in English.

3. Select the "NVIDIA GeForce RTX 5090" and "PyTorch" images, and choose "Pay As You Go" or "Daily Plan/Weekly Plan/Monthly Plan" as needed, then click "Continue job execution".

HyperAI is offering a registration bonus for new users: for just $1, you can get 5 hours of RTX 5090 computing power (originally priced at $2.45), and the resources are valid indefinitely.

4. Wait for resource allocation. The first clone will take approximately 3 minutes. Once the status changes to "Running", click the jump arrow next to "API address" to go to the Demo page.

Effect Demonstration

As the year draws to a close, click to play VibeVoice's New Year's greetings for you!

The above is the tutorial recommended by HyperAI this time. Everyone is welcome to come and experience it!

Tutorial Link:

https://go.hyper.ai/jdZrA

Command Palette

Online Tutorial | Microsoft Open Sources VibeVoice, Enabling 90 Minutes of Natural Dialogue Between 4 Roles

Demo Run

Effect Demonstration

Command Palette

Online Tutorial | Microsoft Open Sources VibeVoice, Enabling 90 Minutes of Natural Dialogue Between 4 Roles

Demo Run

Effect Demonstration

Related News

Achieve "voice-over Freedom" With Just 3 Seconds of Audio: Mistral open-source Speech Model Voxtral-4B-TTS-2603; Set a New Benchmark for Data Quality: Sutra 10B Pretraining.

Can Emojis Control Speech Generation? Irodori-TTS Is a Japanese TTS Based on the RF-DiT Architecture; Eczema and Tinea Skin Disease Datasets: Supporting Medical Image Classification and Transfer learning.

Online Tutorial | Supports 600+ Languages, Xiaomi Open Sources OmniVoice: Achieve Voice Cloning With Just 3-10 Seconds of Reference Audio

Tencent open-sources Hy-MT1.5 Translation Model: 440MB Achieves top-tier Translation Capabilities; MIT Jointly Releases MathNet: a Multimodal Mathematical Inference Benchmark Covering 27,000 Real Olympiad Math problems.

Fast and Accurate! Cohere Releases open-source Transcription Model; Accurate Parsing of Complex Scenarios: Chandra-ocr-2 Visual Language Model Achieves Precise OCR.

Free CPU Tutorial | Achieving 8.8k Stars, the Supertonic-3 TTS Model Has Only About 99M Parameters and Supports 31 languages.

Online Tutorial | HKU Team Open Sources DeepTutor, a Personal Learning Assistant That Enables Interactive Learning Covering Understanding, Reasoning, and Generation Through Multi-Agent Collaboration

Zero-sampling TTS Breakthrough! A Few Seconds of Reference Audio, OmniVoice Helps You Easily Clone Hundreds of Languages; 17 Languages All in One Go: MDPbench Solves the Major Problem of Parsing low-resource Text systems.

A Locally Runnable Privacy Detection Model: Privacy Filter Achieves high-quality PII Filtering at Low Cost; Hardcore Open Source! Covering the Transfermarkt Structured Football Dataset With Over 80,000 matches.

Command Palette

Online Tutorial | Microsoft Open Sources VibeVoice, Enabling 90 Minutes of Natural Dialogue Between 4 Roles

Demo Run

Effect Demonstration

Related News

Achieve "voice-over Freedom" With Just 3 Seconds of Audio: Mistral open-source Speech Model Voxtral-4B-TTS-2603; Set a New Benchmark for Data Quality: Sutra 10B Pretraining.

Can Emojis Control Speech Generation? Irodori-TTS Is a Japanese TTS Based on the RF-DiT Architecture; Eczema and Tinea Skin Disease Datasets: Supporting Medical Image Classification and Transfer learning.

Online Tutorial | Supports 600+ Languages, Xiaomi Open Sources OmniVoice: Achieve Voice Cloning With Just 3-10 Seconds of Reference Audio

Tencent open-sources Hy-MT1.5 Translation Model: 440MB Achieves top-tier Translation Capabilities; MIT Jointly Releases MathNet: a Multimodal Mathematical Inference Benchmark Covering 27,000 Real Olympiad Math problems.

Fast and Accurate! Cohere Releases open-source Transcription Model; Accurate Parsing of Complex Scenarios: Chandra-ocr-2 Visual Language Model Achieves Precise OCR.

Free CPU Tutorial | Achieving 8.8k Stars, the Supertonic-3 TTS Model Has Only About 99M Parameters and Supports 31 languages.

Online Tutorial | HKU Team Open Sources DeepTutor, a Personal Learning Assistant That Enables Interactive Learning Covering Understanding, Reasoning, and Generation Through Multi-Agent Collaboration

Zero-sampling TTS Breakthrough! A Few Seconds of Reference Audio, OmniVoice Helps You Easily Clone Hundreds of Languages; 17 Languages All in One Go: MDPbench Solves the Major Problem of Parsing low-resource Text systems.

A Locally Runnable Privacy Detection Model: Privacy Filter Achieves high-quality PII Filtering at Low Cost; Hardcore Open Source! Covering the Transfermarkt Structured Football Dataset With Over 80,000 matches.

Related News

Achieve "voice-over Freedom" With Just 3 Seconds of Audio: Mistral open-source Speech Model Voxtral-4B-TTS-2603; Set a New Benchmark for Data Quality: Sutra 10B Pretraining.

Can Emojis Control Speech Generation? Irodori-TTS Is a Japanese TTS Based on the RF-DiT Architecture; Eczema and Tinea Skin Disease Datasets: Supporting Medical Image Classification and Transfer learning.

Online Tutorial | Supports 600+ Languages, Xiaomi Open Sources OmniVoice: Achieve Voice Cloning With Just 3-10 Seconds of Reference Audio

Tencent open-sources Hy-MT1.5 Translation Model: 440MB Achieves top-tier Translation Capabilities; MIT Jointly Releases MathNet: a Multimodal Mathematical Inference Benchmark Covering 27,000 Real Olympiad Math problems.

Fast and Accurate! Cohere Releases open-source Transcription Model; Accurate Parsing of Complex Scenarios: Chandra-ocr-2 Visual Language Model Achieves Precise OCR.

Free CPU Tutorial | Achieving 8.8k Stars, the Supertonic-3 TTS Model Has Only About 99M Parameters and Supports 31 languages.

Online Tutorial | HKU Team Open Sources DeepTutor, a Personal Learning Assistant That Enables Interactive Learning Covering Understanding, Reasoning, and Generation Through Multi-Agent Collaboration

Zero-sampling TTS Breakthrough! A Few Seconds of Reference Audio, OmniVoice Helps You Easily Clone Hundreds of Languages; 17 Languages All in One Go: MDPbench Solves the Major Problem of Parsing low-resource Text systems.

A Locally Runnable Privacy Detection Model: Privacy Filter Achieves high-quality PII Filtering at Low Cost; Hardcore Open Source! Covering the Transfermarkt Structured Football Dataset With Over 80,000 matches.

Related News

Achieve "voice-over Freedom" With Just 3 Seconds of Audio: Mistral open-source Speech Model Voxtral-4B-TTS-2603; Set a New Benchmark for Data Quality: Sutra 10B Pretraining.

Can Emojis Control Speech Generation? Irodori-TTS Is a Japanese TTS Based on the RF-DiT Architecture; Eczema and Tinea Skin Disease Datasets: Supporting Medical Image Classification and Transfer learning.

Online Tutorial | Supports 600+ Languages, Xiaomi Open Sources OmniVoice: Achieve Voice Cloning With Just 3-10 Seconds of Reference Audio

Tencent open-sources Hy-MT1.5 Translation Model: 440MB Achieves top-tier Translation Capabilities; MIT Jointly Releases MathNet: a Multimodal Mathematical Inference Benchmark Covering 27,000 Real Olympiad Math problems.

Fast and Accurate! Cohere Releases open-source Transcription Model; Accurate Parsing of Complex Scenarios: Chandra-ocr-2 Visual Language Model Achieves Precise OCR.

Free CPU Tutorial | Achieving 8.8k Stars, the Supertonic-3 TTS Model Has Only About 99M Parameters and Supports 31 languages.

Online Tutorial | HKU Team Open Sources DeepTutor, a Personal Learning Assistant That Enables Interactive Learning Covering Understanding, Reasoning, and Generation Through Multi-Agent Collaboration

Zero-sampling TTS Breakthrough! A Few Seconds of Reference Audio, OmniVoice Helps You Easily Clone Hundreds of Languages; 17 Languages All in One Go: MDPbench Solves the Major Problem of Parsing low-resource Text systems.

A Locally Runnable Privacy Detection Model: Privacy Filter Achieves high-quality PII Filtering at Low Cost; Hardcore Open Source! Covering the Transfermarkt Structured Football Dataset With Over 80,000 matches.