Online Tutorial | VibeVoice-1.5B's Unique dual-tokenizer Architecture Enables the Generation of a 90-minute Conversation Between Four People, Redefining the Boundaries of TTS technology.

Microsoft's latest open-source VibeVoice-1.5B model has caused a sensation in the field of TTS technology. This model, with 1.5 billion parameters, can generate up to 90 minutes of highly natural speech at a time and support simulating conversations with up to four different speakers. Its official blind test MOS (mean opinion score) is as high as 4.5, which is close to the quality of real human voice.

The core innovation of VibeVoice-1.5B lies in its unique dual-Tokenizer architecture and diffusion decoding technology.Based on the Qwen2.5 language model, it uses an acoustic tokenizer (using a σ-VAE architecture to achieve 3,200x audio compression) and a semantic tokenizer (focused on preserving textual sentiment and pauses) to process audio sequences at an ultra-low frame rate of just 7.5 Hz. On the decoding side, a 123 million-parameter diffusion decoder, coupled with the DPM-Solver algorithm, reconstructs high-fidelity audio details.

VibeVoice-1.5B is primarily targeted at the research and developer communities, providing new tools for podcast production, conversational AI, and voice content generation. However, it's important to note that it currently only supports Chinese and English and cannot handle overlapping speech or generate background sound effects. Microsoft explicitly emphasizes its research use and includes an audible disclaimer and imperceptible watermarking technology to prevent misuse.

at present,Microsoft VibeVoice-1.5B redefines the boundaries of TTS technologyIt has been launched in the "Tutorial" section of HyperAI's official website.Click the link below to deploy with one click.

Tutorial Link:

https://go.hyper.ai/6Ii8l

HyperAI exclusive invitation link (copy and open in browser):

https://openbayes.com/console/signup?r=Ada0322_NR0n

Demo Run

1. On the hyper.ai homepage, select the Tutorials page, choose Microsoft VibeVoice-1.5B: Redefining the Boundaries of TTS Technology, and click Run this Tutorial Online.

2. After the page jumps, click "Clone" in the upper right corner to clone the tutorial into your own container.

3. Select "NVIDIA GeForce RTX 4090." The OpenBayes platform offers four billing options: "Pay as you go" or "Daily/Weekly/Monthly" based on your needs. After selecting the "PyTorch" image, click "Continue." New users can register using the invitation link below to receive 4 hours of free RTX 4090 and 5 hours of free CPU time!

HyperAI exclusive invitation link (copy and open in browser):

https://openbayes.com/console/signup?r=Ada0322_NR0n

4. Wait for resources to be allocated. The first clone will take about 2 minutes. When the status changes to "Running", click the jump arrow next to "API Address" to jump to the Demo page. Please note that users must complete real-name authentication before using the API address access function.

Effect Demonstration

After entering the model page, select the number of speakers in "Number of Speakers", set the speakers in "Speaker 1-4", enter the conversation text in "Conversation Script", and finally click "Generate Podcast".

Taking a four-person conversation as an example, the author generated a voice:

*prompt:

Speaker 1: How about trying that new café this weekend? I heard their pour-over coffee is good.

Speaker 2: Sure! But I have to go to yoga on Saturday afternoon, so I'm free on Sunday morning.

Speaker 3: Sunday morning works for me too. I just want to talk to you guys about the team building next week.

Speaker 4: Then I have no problem! Let's meet at the café entrance at 10 am on Sunday?

Speaker 1: Great, I'll reserve a window seat in advance.

This is the recommended tutorial for this issue. Welcome everyone to try it out for yourself⬇️

Tutorial Link:https://go.hyper.ai/6Ii8l

Get high-quality papers and in-depth interpretation articles in the field of AI4S from 2023 to 2024 with one click⬇️

HyperAI

Online Tutorial | VibeVoice-1.5B's Unique dual-tokenizer Architecture Enables the Generation of a 90-minute Conversation Between Four People, Redefining the Boundaries of TTS technology.

10 months ago

Information

Artificial Intelligence

Tutorial Link:

https://go.hyper.ai/6Ii8l

HyperAI exclusive invitation link (copy and open in browser):

https://openbayes.com/console/signup?r=Ada0322_NR0n

Demo Run

1. On the hyper.ai homepage, select the Tutorials page, choose Microsoft VibeVoice-1.5B: Redefining the Boundaries of TTS Technology, and click Run this Tutorial Online.

2. After the page jumps, click "Clone" in the upper right corner to clone the tutorial into your own container.

HyperAI exclusive invitation link (copy and open in browser):

https://openbayes.com/console/signup?r=Ada0322_NR0n

Effect Demonstration

Taking a four-person conversation as an example, the author generated a voice:

*prompt:

Speaker 1: How about trying that new café this weekend? I heard their pour-over coffee is good.

Speaker 2: Sure! But I have to go to yoga on Saturday afternoon, so I'm free on Sunday morning.

Speaker 3: Sunday morning works for me too. I just want to talk to you guys about the team building next week.

Speaker 4: Then I have no problem! Let's meet at the café entrance at 10 am on Sunday?

Speaker 1: Great, I'll reserve a window seat in advance.

This is the recommended tutorial for this issue. Welcome everyone to try it out for yourself⬇️

Tutorial Link:https://go.hyper.ai/6Ii8l

Get high-quality papers and in-depth interpretation articles in the field of AI4S from 2023 to 2024 with one click⬇️

Online Tutorial | VibeVoice-1.5B's Unique dual-tokenizer Architecture Enables the Generation of a 90-minute Conversation Between Four People, Redefining the Boundaries of TTS technology.

10 months ago

Information

Artificial Intelligence

Tutorial Link:

https://go.hyper.ai/6Ii8l

HyperAI exclusive invitation link (copy and open in browser):

https://openbayes.com/console/signup?r=Ada0322_NR0n

Demo Run

1. On the hyper.ai homepage, select the Tutorials page, choose Microsoft VibeVoice-1.5B: Redefining the Boundaries of TTS Technology, and click Run this Tutorial Online.

2. After the page jumps, click "Clone" in the upper right corner to clone the tutorial into your own container.

HyperAI exclusive invitation link (copy and open in browser):

https://openbayes.com/console/signup?r=Ada0322_NR0n

Effect Demonstration

Taking a four-person conversation as an example, the author generated a voice:

*prompt:

Speaker 1: How about trying that new café this weekend? I heard their pour-over coffee is good.

Speaker 2: Sure! But I have to go to yoga on Saturday afternoon, so I'm free on Sunday morning.

Speaker 3: Sunday morning works for me too. I just want to talk to you guys about the team building next week.

Speaker 4: Then I have no problem! Let's meet at the café entrance at 10 am on Sunday?

Speaker 1: Great, I'll reserve a window seat in advance.

This is the recommended tutorial for this issue. Welcome everyone to try it out for yourself⬇️

Tutorial Link:https://go.hyper.ai/6Ii8l

Get high-quality papers and in-depth interpretation articles in the field of AI4S from 2023 to 2024 with one click⬇️

Command Palette

Online Tutorial | VibeVoice-1.5B's Unique dual-tokenizer Architecture Enables the Generation of a 90-minute Conversation Between Four People, Redefining the Boundaries of TTS technology.

Command Palette

Online Tutorial | VibeVoice-1.5B's Unique dual-tokenizer Architecture Enables the Generation of a 90-minute Conversation Between Four People, Redefining the Boundaries of TTS technology.

Related News

Can Emojis Control Speech Generation? Irodori-TTS Is a Japanese TTS Based on the RF-DiT Architecture; Eczema and Tinea Skin Disease Datasets: Supporting Medical Image Classification and Transfer learning.

Achieve "voice-over Freedom" With Just 3 Seconds of Audio: Mistral open-source Speech Model Voxtral-4B-TTS-2603; Set a New Benchmark for Data Quality: Sutra 10B Pretraining.

Zero-sampling TTS Breakthrough! A Few Seconds of Reference Audio, OmniVoice Helps You Easily Clone Hundreds of Languages; 17 Languages All in One Go: MDPbench Solves the Major Problem of Parsing low-resource Text systems.

Online Tutorial | Supports 600+ Languages, Xiaomi Open Sources OmniVoice: Achieve Voice Cloning With Just 3-10 Seconds of Reference Audio

4-step Image output/4K quality/6x Speedup, PiD Uses Pixel Diffusion to Unify Decoding and super-resolution Output; SA-3DAO: a Dataset Containing 1000 Pairs of Real Images Paired With Handcrafted 3D Meshes by artists.

Tencent open-sources Hy-MT1.5 Translation Model: 440MB Achieves top-tier Translation Capabilities; MIT Jointly Releases MathNet: a Multimodal Mathematical Inference Benchmark Covering 27,000 Real Olympiad Math problems.

Free CPU Tutorial | Achieving 8.8k Stars, the Supertonic-3 TTS Model Has Only About 99M Parameters and Supports 31 languages.

Supports live-action/animation/animal-driven Video Generation; Meituan's open-source multi-style audio-driven Video Generation Framework LongCat 1.5 Enhances VLM's Chart Reconstruction and Table Extraction Capabilities Using the million-level Chart Understanding Dataset ChartNet.

A Locally Runnable Privacy Detection Model: Privacy Filter Achieves high-quality PII Filtering at Low Cost; Hardcore Open Source! Covering the Transfermarkt Structured Football Dataset With Over 80,000 matches.

Command Palette

Online Tutorial | VibeVoice-1.5B's Unique dual-tokenizer Architecture Enables the Generation of a 90-minute Conversation Between Four People, Redefining the Boundaries of TTS technology.

Related News

Can Emojis Control Speech Generation? Irodori-TTS Is a Japanese TTS Based on the RF-DiT Architecture; Eczema and Tinea Skin Disease Datasets: Supporting Medical Image Classification and Transfer learning.

Achieve "voice-over Freedom" With Just 3 Seconds of Audio: Mistral open-source Speech Model Voxtral-4B-TTS-2603; Set a New Benchmark for Data Quality: Sutra 10B Pretraining.

Zero-sampling TTS Breakthrough! A Few Seconds of Reference Audio, OmniVoice Helps You Easily Clone Hundreds of Languages; 17 Languages All in One Go: MDPbench Solves the Major Problem of Parsing low-resource Text systems.

Online Tutorial | Supports 600+ Languages, Xiaomi Open Sources OmniVoice: Achieve Voice Cloning With Just 3-10 Seconds of Reference Audio

4-step Image output/4K quality/6x Speedup, PiD Uses Pixel Diffusion to Unify Decoding and super-resolution Output; SA-3DAO: a Dataset Containing 1000 Pairs of Real Images Paired With Handcrafted 3D Meshes by artists.

Tencent open-sources Hy-MT1.5 Translation Model: 440MB Achieves top-tier Translation Capabilities; MIT Jointly Releases MathNet: a Multimodal Mathematical Inference Benchmark Covering 27,000 Real Olympiad Math problems.

Free CPU Tutorial | Achieving 8.8k Stars, the Supertonic-3 TTS Model Has Only About 99M Parameters and Supports 31 languages.

Supports live-action/animation/animal-driven Video Generation; Meituan's open-source multi-style audio-driven Video Generation Framework LongCat 1.5 Enhances VLM's Chart Reconstruction and Table Extraction Capabilities Using the million-level Chart Understanding Dataset ChartNet.

A Locally Runnable Privacy Detection Model: Privacy Filter Achieves high-quality PII Filtering at Low Cost; Hardcore Open Source! Covering the Transfermarkt Structured Football Dataset With Over 80,000 matches.

Related News

Can Emojis Control Speech Generation? Irodori-TTS Is a Japanese TTS Based on the RF-DiT Architecture; Eczema and Tinea Skin Disease Datasets: Supporting Medical Image Classification and Transfer learning.

Achieve "voice-over Freedom" With Just 3 Seconds of Audio: Mistral open-source Speech Model Voxtral-4B-TTS-2603; Set a New Benchmark for Data Quality: Sutra 10B Pretraining.

Zero-sampling TTS Breakthrough! A Few Seconds of Reference Audio, OmniVoice Helps You Easily Clone Hundreds of Languages; 17 Languages All in One Go: MDPbench Solves the Major Problem of Parsing low-resource Text systems.

Online Tutorial | Supports 600+ Languages, Xiaomi Open Sources OmniVoice: Achieve Voice Cloning With Just 3-10 Seconds of Reference Audio

4-step Image output/4K quality/6x Speedup, PiD Uses Pixel Diffusion to Unify Decoding and super-resolution Output; SA-3DAO: a Dataset Containing 1000 Pairs of Real Images Paired With Handcrafted 3D Meshes by artists.

Tencent open-sources Hy-MT1.5 Translation Model: 440MB Achieves top-tier Translation Capabilities; MIT Jointly Releases MathNet: a Multimodal Mathematical Inference Benchmark Covering 27,000 Real Olympiad Math problems.

Free CPU Tutorial | Achieving 8.8k Stars, the Supertonic-3 TTS Model Has Only About 99M Parameters and Supports 31 languages.

Supports live-action/animation/animal-driven Video Generation; Meituan's open-source multi-style audio-driven Video Generation Framework LongCat 1.5 Enhances VLM's Chart Reconstruction and Table Extraction Capabilities Using the million-level Chart Understanding Dataset ChartNet.

A Locally Runnable Privacy Detection Model: Privacy Filter Achieves high-quality PII Filtering at Low Cost; Hardcore Open Source! Covering the Transfermarkt Structured Football Dataset With Over 80,000 matches.

Related News

Can Emojis Control Speech Generation? Irodori-TTS Is a Japanese TTS Based on the RF-DiT Architecture; Eczema and Tinea Skin Disease Datasets: Supporting Medical Image Classification and Transfer learning.

Achieve "voice-over Freedom" With Just 3 Seconds of Audio: Mistral open-source Speech Model Voxtral-4B-TTS-2603; Set a New Benchmark for Data Quality: Sutra 10B Pretraining.

Zero-sampling TTS Breakthrough! A Few Seconds of Reference Audio, OmniVoice Helps You Easily Clone Hundreds of Languages; 17 Languages All in One Go: MDPbench Solves the Major Problem of Parsing low-resource Text systems.

Online Tutorial | Supports 600+ Languages, Xiaomi Open Sources OmniVoice: Achieve Voice Cloning With Just 3-10 Seconds of Reference Audio

4-step Image output/4K quality/6x Speedup, PiD Uses Pixel Diffusion to Unify Decoding and super-resolution Output; SA-3DAO: a Dataset Containing 1000 Pairs of Real Images Paired With Handcrafted 3D Meshes by artists.

Tencent open-sources Hy-MT1.5 Translation Model: 440MB Achieves top-tier Translation Capabilities; MIT Jointly Releases MathNet: a Multimodal Mathematical Inference Benchmark Covering 27,000 Real Olympiad Math problems.

Free CPU Tutorial | Achieving 8.8k Stars, the Supertonic-3 TTS Model Has Only About 99M Parameters and Supports 31 languages.

Supports live-action/animation/animal-driven Video Generation; Meituan's open-source multi-style audio-driven Video Generation Framework LongCat 1.5 Enhances VLM's Chart Reconstruction and Table Extraction Capabilities Using the million-level Chart Understanding Dataset ChartNet.

A Locally Runnable Privacy Detection Model: Privacy Filter Achieves high-quality PII Filtering at Low Cost; Hardcore Open Source! Covering the Transfermarkt Structured Football Dataset With Over 80,000 matches.