1. Tutorial Introduction

Voxtral is an advanced audio model launched by Mistral AI in July 2025. Based on its excellent speech transcription and deep understanding capabilities, it promotes voice as a natural way of human-computer interaction. Voxtral is available in 24B and 3B versions, suitable for production scale and local deployment respectively. Voxtral supports multiple languages, long text context, built-in question and answer and summary functions, and can directly trigger backend function calls. Voxtral's performance surpasses existing open source models and proprietary APIs in multiple benchmarks, while being lower in cost and widely used in various scenarios, helping to popularize voice interaction.

Key features:

Long text contextual processing: Supports up to 30 minutes of audio transcription and 40 minutes of audio understanding, and can handle complex long-form content.

Built-in Q&A and summarization: Ask questions directly about the audio content or generate structured summaries without the need for additional ASR and language models.

Multi-language support: Automatic language detection, support for multiple common languages (such as English, Spanish, French, Portuguese, Hindi, German, etc.) to meet the needs of global users.

Voice-triggered function calls: Directly trigger backend functions, workflows, or API calls based on user voice intent without the need for intermediate parsing steps.

Text comprehension capability: The text comprehension capability of Mistral Small 3.1 is retained, supporting text input and processing.

Optimized transcription performance: Provides highly optimized transcription endpoints that are cost-effective and suitable for large-scale applications.

The computing resources of this tutorial use dual-card RTX A6000, and the model deployed in this tutorial is Voxtral-Small-24B-2507. Two functions, Audio Transcription and Audio Understanding, are provided for testing.

4. Discussion

🖌️ If you see a high-quality project, please leave a message in the background to recommend it! In addition, we have also established a tutorial exchange group. Welcome friends to scan the QR code and remark [SD Tutorial] to join the group to discuss various technical issues and share application effects↓

HyperAI

Run this Notebook

Date

7 months ago

Size

3.31 MB

License

Apache 2.0

1. Tutorial Introduction

Key features:

Long text contextual processing: Supports up to 30 minutes of audio transcription and 40 minutes of audio understanding, and can handle complex long-form content.
Built-in Q&A and summarization: Ask questions directly about the audio content or generate structured summaries without the need for additional ASR and language models.
Multi-language support: Automatic language detection, support for multiple common languages (such as English, Spanish, French, Portuguese, Hindi, German, etc.) to meet the needs of global users.
Voice-triggered function calls: Directly trigger backend functions, workflows, or API calls based on user voice intent without the need for intermediate parsing steps.
Text comprehension capability: The text comprehension capability of Mistral Small 3.1 is retained, supporting text input and processing.
Optimized transcription performance: Provides highly optimized transcription endpoints that are cost-effective and suitable for large-scale applications.

The computing resources of this tutorial use dual-card RTX A6000, and the model deployed in this tutorial is Voxtral-Small-24B-2507. Two functions, Audio Transcription and Audio Understanding, are provided for testing.

Appendix: One-click deployment of 3B Voxtral model Demo

2. Effect display

Audio Transcription

Audio Understanding

3. Operation steps

1. Start the container

2. Usage steps

If "Bad Gateway" is displayed, it means the model is initializing. Since the model is large, please wait about 5-10 minutes and refresh the page.

1. Audio Transcription

2. Audio Understanding

4. Discussion

This notebook is contributed by community users and is intended for educational and informational purposes only. If any content involves copyright infringement, please contact us at [email protected] for prompt review and removal.

Related Notebooks

SoulX-Podcast: Podcast-quality long-text Speech Generation for Multiple dialects.

2 months ago

Krea-realtime-video: Real-time Video Generation Model

3 months ago

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

1. Tutorial Introduction

Key features:

Long text contextual processing: Supports up to 30 minutes of audio transcription and 40 minutes of audio understanding, and can handle complex long-form content.

Built-in Q&A and summarization: Ask questions directly about the audio content or generate structured summaries without the need for additional ASR and language models.

Multi-language support: Automatic language detection, support for multiple common languages (such as English, Spanish, French, Portuguese, Hindi, German, etc.) to meet the needs of global users.

Voice-triggered function calls: Directly trigger backend functions, workflows, or API calls based on user voice intent without the need for intermediate parsing steps.

Text comprehension capability: The text comprehension capability of Mistral Small 3.1 is retained, supporting text input and processing.

Optimized transcription performance: Provides highly optimized transcription endpoints that are cost-effective and suitable for large-scale applications.

4. Discussion

Command Palette

Voxtral-Small-24B-2507 Speech Understanding Model Demo

1. Tutorial Introduction