MiniMax-Speech: Intrinsic Zero-Shot Text-to-Speech with a Learnable Speaker Encoder

We introduce MiniMax-Speech, an autoregressive Transformer-basedText-to-Speech (TTS) model that generates high-quality speech. A key innovationis our learnable speaker encoder, which extracts timbre features from areference audio without requiring its transcription. This enablesMiniMax-Speech to produce highly expressive speech with timbre consistent withthe reference in a zero-shot manner, while also supporting one-shot voicecloning with exceptionally high similarity to the reference voice. In addition,the overall quality of the synthesized audio is enhanced through the proposedFlow-VAE. Our model supports 32 languages and demonstrates excellentperformance across multiple objective and subjective evaluations metrics.Notably, it achieves state-of-the-art (SOTA) results on objective voice cloningmetrics (Word Error Rate and Speaker Similarity) and has secured the topposition on the public TTS Arena leaderboard. Another key strength ofMiniMax-Speech, granted by the robust and disentangled representations from thespeaker encoder, is its extensibility without modifying the base model,enabling various applications such as: arbitrary voice emotion control viaLoRA; text to voice (T2V) by synthesizing timbre features directly from textdescription; and professional voice cloning (PVC) by fine-tuning timbrefeatures with additional data. We encourage readers to visithttps://minimax-ai.github.io/tts_tech_report for more examples.