HyperAIHyperAI

Online Tutorial | VibeVoice-1.5B's Unique dual-tokenizer Architecture Enables the Generation of a 90-minute Conversation Between Four People, Redefining the Boundaries of TTS technology.

特色图像

Microsoft's latest open-source VibeVoice-1.5B model has caused a sensation in the field of TTS technology. This model, with 1.5 billion parameters, can generate up to 90 minutes of highly natural speech at a time and support simulating conversations with up to four different speakers. Its official blind test MOS (mean opinion score) is as high as 4.5, which is close to the quality of real human voice.

The core innovation of VibeVoice-1.5B lies in its unique dual-Tokenizer architecture and diffusion decoding technology.Based on the Qwen2.5 language model, it uses an acoustic tokenizer (using a σ-VAE architecture to achieve 3,200x audio compression) and a semantic tokenizer (focused on preserving textual sentiment and pauses) to process audio sequences at an ultra-low frame rate of just 7.5 Hz. On the decoding side, a 123 million-parameter diffusion decoder, coupled with the DPM-Solver algorithm, reconstructs high-fidelity audio details.

VibeVoice-1.5B is primarily targeted at the research and developer communities, providing new tools for podcast production, conversational AI, and voice content generation. However, it's important to note that it currently only supports Chinese and English and cannot handle overlapping speech or generate background sound effects. Microsoft explicitly emphasizes its research use and includes an audible disclaimer and imperceptible watermarking technology to prevent misuse.

at present,Microsoft VibeVoice-1.5B redefines the boundaries of TTS technologyIt has been launched in the "Tutorial" section of HyperAI's official website.Click the link below to deploy with one click.

Tutorial Link:

https://go.hyper.ai/6Ii8l

HyperAI exclusive invitation link (copy and open in browser):

https://openbayes.com/console/signup?r=Ada0322_NR0n

Demo Run

1. On the hyper.ai homepage, select the Tutorials page, choose Microsoft VibeVoice-1.5B: Redefining the Boundaries of TTS Technology, and click Run this Tutorial Online.

2. After the page jumps, click "Clone" in the upper right corner to clone the tutorial into your own container.

3. Select "NVIDIA GeForce RTX 4090." The OpenBayes platform offers four billing options: "Pay as you go" or "Daily/Weekly/Monthly" based on your needs. After selecting the "PyTorch" image, click "Continue." New users can register using the invitation link below to receive 4 hours of free RTX 4090 and 5 hours of free CPU time!

HyperAI exclusive invitation link (copy and open in browser):

https://openbayes.com/console/signup?r=Ada0322_NR0n

4. Wait for resources to be allocated. The first clone will take about 2 minutes. When the status changes to "Running", click the jump arrow next to "API Address" to jump to the Demo page. Please note that users must complete real-name authentication before using the API address access function.

Effect Demonstration

After entering the model page, select the number of speakers in "Number of Speakers", set the speakers in "Speaker 1-4", enter the conversation text in "Conversation Script", and finally click "Generate Podcast".

Taking a four-person conversation as an example, the author generated a voice:

*prompt:

Speaker 1: How about trying that new café this weekend? I heard their pour-over coffee is good.

Speaker 2: Sure! But I have to go to yoga on Saturday afternoon, so I'm free on Sunday morning.

Speaker 3: Sunday morning works for me too. I just want to talk to you guys about the team building next week.

Speaker 4: Then I have no problem! Let's meet at the café entrance at 10 am on Sunday?

Speaker 1: Great, I'll reserve a window seat in advance.

This is the recommended tutorial for this issue. Welcome everyone to try it out for yourself⬇️

Tutorial Link:https://go.hyper.ai/6Ii8l

Get high-quality papers and in-depth interpretation articles in the field of AI4S from 2023 to 2024 with one click⬇️

Online Tutorial | VibeVoice-1.5B's Unique dual-tokenizer Architecture Enables the Generation of a 90-minute Conversation Between Four People, Redefining the Boundaries of TTS technology. | News | HyperAI