Voxtral-Small-24B-2507 Speech Understanding Model Demo
1. Tutorial Introduction
Voxtral is an advanced audio model launched by Mistral AI in July 2025. Based on its excellent speech transcription and deep understanding capabilities, it promotes voice as a natural way of human-computer interaction. Voxtral is available in 24B and 3B versions, suitable for production scale and local deployment respectively. Voxtral supports multiple languages, long text context, built-in question and answer and summary functions, and can directly trigger backend function calls. Voxtral's performance surpasses existing open source models and proprietary APIs in multiple benchmarks, while being lower in cost and widely used in various scenarios, helping to popularize voice interaction.
Key features:
- Long text contextual processing: Supports up to 30 minutes of audio transcription and 40 minutes of audio understanding, and can handle complex long-form content.
- Built-in Q&A and summarization: Ask questions directly about the audio content or generate structured summaries without the need for additional ASR and language models.
- Multi-language support: Automatic language detection, support for multiple common languages (such as English, Spanish, French, Portuguese, Hindi, German, etc.) to meet the needs of global users.
- Voice-triggered function calls: Directly trigger backend functions, workflows, or API calls based on user voice intent without the need for intermediate parsing steps.
- Text comprehension capability: The text comprehension capability of Mistral Small 3.1 is retained, supporting text input and processing.
- Optimized transcription performance: Provides highly optimized transcription endpoints that are cost-effective and suitable for large-scale applications.
The computing resources of this tutorial use dual-card RTX A6000, and the model deployed in this tutorial is Voxtral-Small-24B-2507. Two examples, Audio Transcription and Audio Understanding, are provided for testing.
2. Effect display
Audio Transcription

Audio Understanding

3. Operation steps
1. Start the container

2. Usage steps
If "Bad Gateway" is displayed, it means the model is initializing. Since the model is large, please wait about 5-10 minutes and refresh the page.
1. Audio Transcription

2. Audio Understanding

4. Discussion
🖌️ If you see a high-quality project, please leave a message in the background to recommend it! In addition, we have also established a tutorial exchange group. Welcome friends to scan the QR code and remark [SD Tutorial] to join the group to discuss various technical issues and share application effects↓
