HyperAI

Parakeet-tdt-0.6b-v2 Speech Recognition

GitHub
GitHub Stars

1. Tutorial Introduction

This tutorial uses a single RTX 4090 computing resource, and the model only supports English speech recognition.

parakeet-tdt-0.6b-v2 is a 600 million parameter high-performance automatic speech recognition (ASR) model launched by NVIDIA NeMo on May 1, 2025. It is the latest version of the Parakeet series. The model is based on the FastConformer encoder architecture and the TDT decoder, and can efficiently transcribe up to 24 minutes of English audio clips at a time. The model focuses on high-precision, low-latency English speech transcription tasks and is suitable for real-time English speech-to-text scenarios (such as customer service conversations, meeting records, voice assistants, etc.). The related paper results are "Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition".

2. Operation steps

1. Start the container

If "Bad Gateway" is displayed, it means the model is initializing. Since the model is large, please wait about 1-2 minutes and refresh the page.

2. Use Demonstration

When using the Safari browser, audio may not play directly.

In addition to supporting uploading voice files, this tutorial also supports voice input.

Recognition results can be saved as CSV files

3. Discussion

🖌️ If you see a high-quality project, please leave a message in the background to recommend it! In addition, we have also established a tutorial exchange group. Welcome friends to scan the QR code and remark [SD Tutorial] to join the group to discuss various technical issues and share application effects↓

Project Support

Thanks to Github user SuperYang  Deployment of this tutorial.