2달 전

Alexander H. Liu Alexis Tacnet Andy Ehrenberg Andy Lo Chen-Yo Sun Guillaume Lample Henry Lagarde Jean-Malo Delignon Jaeyoung Kim John Harvill

초록

저희는 3 초의 참조 오디오만으로도 자연스러운 음성을 생성할 수 있는 표현력 있는 다국어 텍스트 - 음성 변환 (TTS) 모델인 Voxtral TTS 를 소개합니다. Voxtral TTS 는 의미적 음성 토큰의 자기회귀 (auto-regressive) 생성과 음향 토큰의 플로우 매칭 (flow-matching) 을 결합한 하이브리드 아키텍처를 채택합니다. 이러한 토큰은 처음부터 학습된 하이브리드 VQ-FSQ 양자화 방식을 적용한 Voxtral Codec 을 통해 인코딩 및 디코딩됩니다. 원어민들이 참여한 인간 평가에서 Voxtral TTS 는 자연스러움과 표현력 측면에서 다국어 음성 클로닝에 유리한 것으로 평가되어 ElevenLabs Flash v2.5 대비 68.4% 의 승률을 기록했습니다. 본 모델의 가중치는 CC BY-NC 라이선스 하에 공개됩니다.

One-sentence Summary

Mistral AI introduces Voxtral TTS, a multilingual text-to-speech model that leverages a hybrid architecture combining auto-regressive semantic token generation with flow-matching for acoustic tokens. This approach enables high-fidelity voice cloning from minimal audio references, outperforming ElevenLabs Flash v2.5 in human evaluations for naturalness and expressivity.

Key Contributions

The paper introduces Voxtral TTS, a multilingual zero-shot text-to-speech model that utilizes a hybrid architecture combining auto-regressive generation for semantic tokens with flow-matching for acoustic tokens to balance long-range consistency and rich acoustic detail.
A new speech tokenizer called Voxtral Codec is presented, which is trained from scratch using a hybrid VQ-FSQ quantization scheme to encode reference audio into low-bitrate semantic and acoustic token streams.
Human evaluations demonstrate that the model achieves a 68.4% win rate over ElevenLabs Flash v2.5 in multilingual voice cloning tasks, while the authors release the model weights under a CC BY-NC license.

Introduction

Natural and expressive text-to-speech remains critical for applications like virtual assistants and accessibility tools, yet capturing human speech nuances in zero-shot settings remains difficult. Prior systems often rely on depth-wise autoregressive acoustic generation, which can limit efficiency and the richness of acoustic detail despite using factorized speech representations. The authors introduce Voxtral TTS, a multilingual zero-shot system that combines an autoregressive transformer for semantic tokens with a lightweight flow-matching model for acoustic tokens to improve both consistency and detail. They further adapt Direct Preference Optimization to this hybrid discrete-continuous setting, achieving superior speaker similarity and naturalness compared to leading proprietary models while supporting low-latency streaming.

Method

The authors introduce Voxtral TTS, which relies on a novel speech tokenizer called Voxtral Codec to discretize audio into semantic and acoustic tokens. The Codec architecture is detailed in the figure below. It processes 24 kHz mono waveforms through a convolutional-transformer encoder that projects input patches into embeddings. These embeddings pass through multiple blocks containing causal self-attention and causal CNN layers for downsampling. The resulting latent representation is split into a 256-dimensional semantic component and 36 acoustic dimensions, which are quantized independently. The semantic component uses Vector Quantization (VQ) with a codebook of size 8192, while the acoustic dimensions use Finite Scalar Quantization (FSQ) with 21 uniform levels. This hybrid scheme achieves a low bitrate of approximately 2.14 kbps. The decoder mirrors the encoder structure to reconstruct the waveform from these discrete tokens.

For the generation process, the authors leverage a hybrid architecture combining auto-regressive modeling with flow-matching. Refer to the framework diagram for the overall system layout. The input consists of voice reference audio tokens and text tokens fed into an autoregressive decoder backbone based on Ministral 3B. This backbone auto-regressively generates a sequence of semantic tokens. At each timestep, the hidden state from the backbone is passed to a flow-matching transformer to predict the corresponding acoustic tokens. The flow-matching transformer models the velocity field transporting Gaussian noise to the acoustic embedding space over a series of function evaluation steps. The semantic tokens are predicted via a linear head projecting hidden states to logits over the semantic vocabulary. Finally, the generated semantic and acoustic tokens are decoded into the final waveform by the Voxtral Codec decoder.

Training involves optimizing the codec with a combination of reconstruction losses, adversarial losses, and an auxiliary ASR distillation loss to align semantic tokens with text content. The TTS model is trained using a two-part loss function consisting of a cross-entropy loss on the semantic tokens and a flow-matching loss on the acoustic tokens. The flow-matching objective minimizes the difference between the predicted velocity field and the conditional velocity target derived from the data distribution and Gaussian noise.

Experiment

Voxtral Codec outperforms Mimi across all objective metrics and matches or exceeds it in subjective speech quality when configured with similar bitrates.
Automatic evaluations show Voxtral TTS significantly surpasses ElevenLabs models in speaker similarity, though ElevenLabs Flash v2.5 leads in some automated intelligibility and quality scores.
Human evaluations reveal that while automated metrics like UTMOS poorly correlate with human preference, Voxtral TTS consistently outperforms competitors in implicit emotion steering and zero-shot voice cloning across diverse languages.
Direct Preference Optimization (DPO) training reduces hallucinations and volume tapering while improving intelligibility and quality, with minimal impact on speaker similarity.
Increasing inference steps to 8 and tuning CFG parameters balances quality and emotion adherence, with lower CFG values preferred for high-quality audio and higher values for in-the-wild recordings.
CUDA graph acceleration reduces latency by 47% and improves real-time factor by 2.5x, enabling high-throughput serving that supports over 30 concurrent users with sub-second latency on a single GPU.

소스 PDF

AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩

바로 사용 가능한 GPU

최적의 가격

시작하기 가격 보기

HyperAI Newsletters

최신 정보 구독하기

한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다

이메일 서비스 제공: MailChimp

2달 전

Alexander H. Liu Alexis Tacnet Andy Ehrenberg Andy Lo Chen-Yo Sun Guillaume Lample Henry Lagarde Jean-Malo Delignon Jaeyoung Kim John Harvill

초록

One-sentence Summary

Key Contributions

The paper introduces Voxtral TTS, a multilingual zero-shot text-to-speech model that utilizes a hybrid architecture combining auto-regressive generation for semantic tokens with flow-matching for acoustic tokens to balance long-range consistency and rich acoustic detail.
A new speech tokenizer called Voxtral Codec is presented, which is trained from scratch using a hybrid VQ-FSQ quantization scheme to encode reference audio into low-bitrate semantic and acoustic token streams.
Human evaluations demonstrate that the model achieves a 68.4% win rate over ElevenLabs Flash v2.5 in multilingual voice cloning tasks, while the authors release the model weights under a CC BY-NC license.

Introduction

Method

Experiment

Voxtral Codec outperforms Mimi across all objective metrics and matches or exceeds it in subjective speech quality when configured with similar bitrates.
Automatic evaluations show Voxtral TTS significantly surpasses ElevenLabs models in speaker similarity, though ElevenLabs Flash v2.5 leads in some automated intelligibility and quality scores.
Human evaluations reveal that while automated metrics like UTMOS poorly correlate with human preference, Voxtral TTS consistently outperforms competitors in implicit emotion steering and zero-shot voice cloning across diverse languages.
Direct Preference Optimization (DPO) training reduces hallucinations and volume tapering while improving intelligibility and quality, with minimal impact on speaker similarity.
Increasing inference steps to 8 and tuning CFG parameters balances quality and emotion adherence, with lower CFG values preferred for high-quality audio and higher values for in-the-wild recordings.
CUDA graph acceleration reduces latency by 47% and improves real-time factor by 2.5x, enabling high-throughput serving that supports over 30 concurrent users with sub-second latency on a single GPU.

소스 PDF

AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩

바로 사용 가능한 GPU

최적의 가격

시작하기 가격 보기

HyperAI Newsletters

최신 정보 구독하기

한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다

이메일 서비스 제공: MailChimp

Command Palette

Voxtral TTS

Alexander H. Liu Alexis Tacnet Andy Ehrenberg Andy Lo Chen-Yo Sun Guillaume Lample Henry Lagarde Jean-Malo Delignon Jaeyoung Kim John Harvill177 more

초록

One-sentence Summary

Key Contributions

Introduction

Method

Experiment

AI로 AI 구축

HyperAI Newsletters

Command Palette

Voxtral TTS

Alexander H. Liu Alexis Tacnet Andy Ehrenberg Andy Lo Chen-Yo Sun Guillaume Lample Henry Lagarde Jean-Malo Delignon Jaeyoung Kim John Harvill177 more

초록

One-sentence Summary

Key Contributions

Introduction

Method

Experiment

AI로 AI 구축

HyperAI Newsletters

Command Palette

Voxtral TTS

Alexander H. Liu Alexis Tacnet Andy Ehrenberg Andy Lo Chen-Yo Sun Guillaume Lample Henry Lagarde Jean-Malo Delignon Jaeyoung Kim John Harvill177 more

초록

One-sentence Summary

Key Contributions

Introduction

Method

Experiment

AI로 AI 구축

HyperAI Newsletters

Alexander H. Liu Alexis Tacnet Andy Ehrenberg Andy Lo Chen-Yo Sun Guillaume Lample Henry Lagarde Jean-Malo Delignon Jaeyoung Kim John Harvill

Alexander H. Liu Alexis Tacnet Andy Ehrenberg Andy Lo Chen-Yo Sun Guillaume Lample Henry Lagarde Jean-Malo Delignon Jaeyoung Kim John Harvill

Alexander H. Liu Alexis Tacnet Andy Ehrenberg Andy Lo Chen-Yo Sun Guillaume Lample Henry Lagarde Jean-Malo Delignon Jaeyoung Kim John Harvill