HyperAIHyperAI

Command Palette

Search for a command to run...

OmniVoice: نحو تحويل النص إلى كلام بأسلوب zero-shot متعدد اللغات باستخدام Diffusion Language Models

Han Zhu Lingxuan Ye Wei Kang Zengwei Yao Liyong Guo Fangjun Kuang Zhifeng Han Weiji Zhuang Long Lin Daniel Povey

الملخص

نقدم لكم OmniVoice، وهو نموذج ضخم لتحويل النص إلى كلام (TTS) يدعم تعدد اللغات ويعمل بتقنية zero-shot، حيث يمتد ليشمل أكثر من 600 لغة. يعتمد النموذج في جوهره على بنية مبتكرة غير ذاتية الانحدار (non-autoregressive - NAR) منفصلة، مصممة بأسلوب نموذج لغة يعتمد على الـ Diffusion.وعلى عكس نماذج الـ NAR المنفصلة التقليدية التي تعاني من اختناقات في الأداء ضمن الـ pipeline المعقد المكون من مرحلتين (من النص إلى الدلالات ثم إلى الصوت)، يقوم OmniVoice بربط النص مباشرة بـ acoustic tokens متعددة الـ codebook. وقد تم تسهيل هذا النهج المبسط من خلال ابتكارين تقنيين رئيسيين:استراتيجية "full-codebook random masking" لضمان كفاءة التدريب.التهيئة (initialization) باستخدام LLM مدرب مسبقاً لضمان وضوح فائق للنطق (intelligibility).ومن خلال الاستفادة من مجموعة بيانات متعددة اللغات تبلغ مدتها 581 ألف ساعة، تم تجميعها بالكامل من بيانات مفتوحة المصدر، يحقق OmniVoice أوسع تغطية لغوية حتى الآن، ويقدم أداءً يمثل حالة متطورة (state-of-the-art) عبر اللغات الصينية والإنجليزية ومختلف الـ benchmarks متعددة اللغات.أكوادنا ونماذجنا المدربة مسبقاً متاحة للجمهور.

One-sentence Summary

OmniVoice is a massive omnilingual zero-shot text-to-speech model scaling to over 600 languages that utilizes a novel diffusion language model-style discrete non-autoregressive architecture to directly map text to multi-codebook acoustic tokens through full-codebook random masking and LLM initialization, achieving state-of-the-art performance across Chinese, English, and diverse multilingual benchmarks.

Key Contributions

  • The paper introduces OmniVoice, a massive multilingual zero-shot text-to-speech model that utilizes a novel single-stage discrete non-autoregressive architecture to directly map text to multi-codebook acoustic tokens.
  • This work implements two key technical innovations, including a full-codebook random masking strategy for training efficiency and the initialization of the model backbone with pre-trained large language models to enhance speech intelligibility.
  • Experiments demonstrate that training on a 581k-hour multilingual dataset enables the model to support over 600 languages and achieve state-of-the-art performance in intelligibility, speaker similarity, and naturalness across Chinese, English, and diverse multilingual benchmarks.

Introduction

Zero-shot text-to-speech (TTS) technology is essential for creating high-quality synthetic voices from minimal audio samples, yet most current models are restricted to a small subset of languages. While existing discrete non-autoregressive (NAR) models offer fast inference, they typically rely on complex two-stage pipelines that suffer from error propagation and information bottlenecks. The authors introduce OmniVoice, a massive multilingual zero-shot TTS model that supports over 600 languages using a streamlined single-stage architecture. By leveraging a full-codebook random masking strategy and initializing the backbone with pre-trained large language model (LLM) weights, the authors enable the model to map text directly to acoustic tokens with superior intelligibility and training efficiency.

Dataset

The authors utilize the following data configurations and benchmarks for OmniVoice:

  • Dataset Composition and Sources: The training data consists of a massive self-built multilingual corpus totaling 581,000 hours across more than 600 languages. This collection integrates a wide variety of sources, including Emilia, LibriTTS, Common Voice, VoxBox, Meta Omnilingual ASR Corpus, FLEURS, GigaSpeech 2, YODAS-Granary, and numerous regional datasets such as IndicVoices-R, Wenetspeech, and various Arabic and Tibetan corpora.
  • Training Configurations: The authors employ two distinct training strategies:
    • Bilingual Variant: This version is trained specifically on the Chinese and English subsets of the Emilia dataset. It is designed to facilitate a fair comparison with existing state of the art zero-shot TTS models. In this configuration, prompt denoising is omitted to isolate the architectural advantages.
    • Multilingual Variant: This version utilizes the full 581k hour multilingual dataset to support extensive language coverage.
  • Evaluation Benchmarks: To assess performance, the authors use four specific benchmarks:
    • LibriSpeech-PC: A standard benchmark for English zero-shot TTS.
    • Seed-TTS: A bilingual benchmark covering Chinese and English.
    • MiniMax-Multilingual-24: A multilingual benchmark spanning 24 languages.
    • FLEURS-Multilingual-102: A benchmark using the dev and test splits of the FLEURS dataset to evaluate 102 languages, representing one of the widest language coverage benchmarks for zero-shot TTS.

Method

The authors propose OmniVoice, a single-stage non-autoregressive (NAR) text-to-speech (TTS) model designed with a diffusion language model-style architecture. Unlike traditional two-stage cascaded pipelines that often suffer from error propagation and information bottlenecks, OmniVoice directly maps text to multi-codebook acoustic tokens in an end-to-end fashion.

The architecture of OmniVoice is designed to process multiple input streams to generate high-fidelity speech. The input consists of a text token sequence YYY, which is a concatenated sequence of instruct and transcript tokens providing linguistic and task-oriented guidance, and an acoustic token matrix XRT×CX \in \mathbb{R}^{T \times C}XRT×C, where TTT represents the number of time steps and CCC represents the number of codebooks. This acoustic matrix is partitioned along the temporal dimension into a prompt segment XpromptX_{\text{prompt}}Xprompt, containing the prefix acoustic context, and a target masked segment XtargetX_{\text{target}}Xtarget, where tokens are replaced with a special mask token [M][M][M].

Refer to the framework diagram:

The model utilizes a bidirectional Transformer backbone, initialized with pre-trained LLM weights. Text tokens are processed through a text embedding layer, while acoustic tokens are processed via codebook-specific embedding layers. To integrate the multi-codebook information, the embeddings of all CCC codebooks at the same temporal position are summed into a unified embedding before being fed into the Transformer. On the output side, the model employs CCC independent, codebook-specific prediction heads that project the final hidden states to output probability distributions over the vocabulary for each corresponding codebook.

The training process is driven by a discrete diffusion objective. The model is trained to recover the original tokens in the masked positions of XtargetX_{\text{target}}Xtarget by leveraging the text conditions YYY, the prompt XpromptX_{\text{prompt}}Xprompt, and the unmasked tokens. Let M\mathcal{M}M denote the set of indices (t,c)(t, c)(t,c) corresponding to masked positions within the target segment, where t{Tp+1,,T}t \in \{T_p + 1, \ldots, T\}t{Tp+1,,T} and c{1,,C}c \in \{1, \ldots, C\}c{1,,C}. The training loss L\mathcal{L}L is formulated as:

L=(t,c)MlogP(xt,cX,Y;θ)\mathcal { L } = - \sum _ { ( t , c ) \in \mathcal { M } } \log P ( x _ { t , c } \mid X , Y ; \theta )L=(t,c)MlogP(xt,cX,Y;θ)

where xt,cx_{t,c}xt,c is the ground-truth acoustic token at time step ttt and codebook index ccc, and P(xt,c;θ)P(x_{t,c} \mid \ldots; \theta)P(xt,c;θ) is the probability distribution predicted by the model parameterized by θ\thetaθ.

To enhance training efficiency, the authors move away from conventional per-layer masking schedules, which only optimize a sparse subset of the token matrix per iteration. Instead, OmniVoice adopts a full-codebook random masking strategy. In this approach, a binary mask mi,jBernoulli(pt)m_{i,j} \sim \text{Bernoulli}(p_t)mi,jBernoulli(pt) is independently sampled for every entry in the T×CT \times CT×C token matrix, with the masking ratio ptp_tpt drawn from a uniform distribution ptU(0,1)p_t \sim \mathcal{U}(0,1)ptU(0,1) for each instance. This strategy ensures that, on average, 50% of the tokens are used for loss computation, significantly accelerating convergence and improving generative quality.

Experiment

The evaluation employs a combination of objective metrics for speaker similarity, intelligibility, and naturalness alongside subjective human assessments to validate the OmniVoice model. Experiments across English, Chinese, and extensive multilingual benchmarks demonstrate that the model achieves commercial-grade performance, showing strong generalization even in low-resource language scenarios. Ablation studies further confirm that key architectural choices, such as full-codebook random masking, LLM initialization, and prompt denoising, are essential for optimizing speech quality and linguistic accuracy. Additionally, the model exhibits high inference efficiency, outperforming existing baselines in real-time generation speed.

The authors conduct a subjective evaluation comparing OmniVoice and its variants against several state-of-the-art TTS models using CMOS and SMOS metrics. The results demonstrate that OmniVoice achieves superior performance in both relative speech quality and absolute speaker similarity. OmniVoice achieves the highest scores in both CMOS and SMOS among all tested models OmniVoice-Emilia outperforms existing NAR baselines in subjective quality and similarity The multilingual OmniVoice model shows competitive advantages in speaker similarity compared to other baseline models

The authors evaluate the impact of LLM initialization on model intelligibility by comparing it against various random initialization configurations. Results show that models utilizing LLM initialization achieve lower word error rates across all tested datasets compared to those with random initialization. LLM initialization consistently yields better intelligibility than random initialization regardless of the learning rate used. The performance advantage of LLM initialization remains superior even after extensive learning rate tuning for the random initialization models. The benefit of inheriting linguistic knowledge from pre-trained LLMs is evident across English and Chinese benchmarks.

The authors evaluate OmniVoice on the FLEURS-Multilingual-102 benchmark to assess its multilingual capabilities. The results show that the model achieves high intelligibility and speaker similarity compared to the ground-truth reference. OmniVoice achieves a lower average character error rate than the ground-truth audio. The model demonstrates high intelligibility with a large number of languages meeting strict error rate thresholds. OmniVoice outperforms the ground-truth in the proportion of languages falling below both 5% and 10% character error rate limits.

The authors evaluate the inference speed of OmniVoice by measuring the real-time factor across various inference steps and batch sizes. The results show that increasing the batch size reduces the real-time factor, indicating higher efficiency during batch inference. Larger batch sizes lead to a lower real-time factor across all tested inference steps Increasing the number of inference steps results in a higher real-time factor The model demonstrates improved inference efficiency when processing multiple samples simultaneously

The authors compare OmniVoice against ElevenLabs across various languages using Word Error Rate (WER) and speaker similarity (SIM-o) metrics. The results demonstrate that OmniVoice achieves competitive or superior performance in both intelligibility and speaker similarity across a wide range of languages. OmniVoice achieves a lower average WER compared to ElevenLabs across the evaluated languages. The model shows higher average speaker similarity than the ElevenLabs baseline. OmniVoice maintains strong performance in both intelligibility and similarity across diverse linguistic groups.

The authors conduct a series of subjective and objective evaluations to validate OmniVoice's speech quality, speaker similarity, multilingual capabilities, and inference efficiency. Through comparisons with state-of-the-art models and baselines, the results demonstrate that OmniVoice provides superior intelligibility and similarity across diverse languages, particularly when utilizing LLM initialization to inherit linguistic knowledge. Furthermore, the model exhibits high efficiency during batch inference and maintains competitive performance against commercial benchmarks.


بناء الذكاء الاصطناعي بالذكاء الاصطناعي

من الفكرة إلى الإطلاق — سرّع تطوير الذكاء الاصطناعي الخاص بك مع المساعدة البرمجية المجانية بالذكاء الاصطناعي، وبيئة جاهزة للاستخدام، وأفضل أسعار لوحدات معالجة الرسومات.

البرمجة التعاونية باستخدام الذكاء الاصطناعي
وحدات GPU جاهزة للعمل
أفضل الأسعار

HyperAI Newsletters

اشترك في آخر تحديثاتنا
سنرسل لك أحدث التحديثات الأسبوعية إلى بريدك الإلكتروني في الساعة التاسعة من صباح كل يوم اثنين
مدعوم بواسطة MailChimp