Voxtral Unveils Transcribe 2: Next-Gen Speech-to-Text Models with Ultra-Low Latency, Multilingual Support, and Open Weights
Voxtral has launched Voxtral Transcribe 2, a new generation of speech-to-text models delivering state-of-the-art transcription quality, advanced diarization, and ultra-low latency. The release includes two specialized models: Voxtral Mini Transcribe V2 for batch processing and Voxtral Realtime for live applications. Voxtral Realtime is released under the open Apache 2.0 license with full model weights available on Hugging Face. Voxtral Realtime is engineered for real-time use cases where speed is critical. Unlike traditional methods that process audio in chunks, this model uses a novel streaming architecture to transcribe audio as it arrives. It supports configurable delays as low as sub-200ms, enabling new possibilities in voice agents, live captioning, and interactive voice interfaces. At a 2.4-second delay, it matches the accuracy of Voxtral Mini Transcribe V2 for subtitling. At 480ms, it maintains a word error rate within 1-2% of offline models, offering near-offline accuracy with real-time responsiveness. The model is natively multilingual, supporting 13 languages including English, Chinese, Hindi, Spanish, Arabic, French, Portuguese, Russian, German, Japanese, Korean, Italian, and Dutch. With a 4B parameter size, it runs efficiently on edge devices, enhancing privacy and security for sensitive applications. Voxtral Mini Transcribe V2 sets a new standard in batch transcription. It achieves a 4% word error rate on the FLEURS benchmark and delivers superior performance over GPT-4o mini Transcribe, Gemini 2.5 Flash, Assembly Universal, and Deepgram Nova in accuracy. It processes audio up to three times faster than ElevenLabs’ Scribe v2 while offering significantly better cost efficiency—just $0.003 per minute. The model includes enterprise-grade features such as speaker diarization with precise timestamps, context biasing to improve recognition of names and technical terms, word-level timestamps for content alignment, and enhanced robustness in noisy environments. It supports audio files up to three hours long in a single request and handles multiple languages with strong non-English performance. Both models support 13 languages and are designed for compliance with GDPR and HIPAA, enabling secure on-premise or private cloud deployments. A new audio playground in Mistral Studio allows users to instantly test the models by uploading audio files up to 1GB in formats like MP3, WAV, M4A, FLAC, and OGG. Users can enable diarization, adjust timestamp precision, and apply context bias terms for domain-specific vocabulary. Voxtral Mini Transcribe V2 is now available via API at $0.003 per minute. Voxtral Realtime is available via API at $0.006 per minute and as open weights on Hugging Face. Developers can try both models in Mistral Studio or Le Chat. For those interested in shaping the future of speech AI, Voxtral is hiring to build cutting-edge models and tools for developers worldwide.
