HyperAI超神経

MiniMax-Speech: Intrinsic Zero-Shot Text-to-Speech with a Learnable Speaker Encoder

Bowen Zhang, Congchao Guo, Geng Yang, Hang Yu, Haozhe Zhang, Heidi Lei, Jialong Mai, Junjie Yan, Kaiyue Yang, Mingqi Yang, Peikai Huang, Ruiyang Jin, Sitan Jiang, Weihua Cheng, Yawei Li, Yichen Xiao, Yiying Zhou, Yongmao Zhang, Yuan Lu, Yucen He
公開日: 5/14/2025
MiniMax-Speech: Intrinsic Zero-Shot Text-to-Speech with a Learnable
  Speaker Encoder
要約

We introduce MiniMax-Speech, an autoregressive Transformer-basedText-to-Speech (TTS) model that generates high-quality speech. A key innovationis our learnable speaker encoder, which extracts timbre features from areference audio without requiring its transcription. This enablesMiniMax-Speech to produce highly expressive speech with timbre consistent withthe reference in a zero-shot manner, while also supporting one-shot voicecloning with exceptionally high similarity to the reference voice. In addition,the overall quality of the synthesized audio is enhanced through the proposedFlow-VAE. Our model supports 32 languages and demonstrates excellentperformance across multiple objective and subjective evaluations metrics.Notably, it achieves state-of-the-art (SOTA) results on objective voice cloningmetrics (Word Error Rate and Speaker Similarity) and has secured the topposition on the public TTS Arena leaderboard. Another key strength ofMiniMax-Speech, granted by the robust and disentangled representations from thespeaker encoder, is its extensibility without modifying the base model,enabling various applications such as: arbitrary voice emotion control viaLoRA; text to voice (T2V) by synthesizing timbre features directly from textdescription; and professional voice cloning (PVC) by fine-tuning timbrefeatures with additional data. We encourage readers to visithttps://minimax-ai.github.io/tts_tech_report for more examples.