HyperAI

Whisper-large-v3-turbo Speech Recognition and Translation Demo

whisper-large-v3-turbo: 8 times faster than large-v3 with almost no loss in quality

1. Tutorial Introduction

Whisper is a general-purpose speech recognition model. It is trained on a large and diverse audio dataset and can performMulti-tasks such as multi-language speech recognition and speech translation.

  • Multilingual speech recognition: Automatically identify the language in the audio and convert it to the original language for output
  • Language translation: Based on recognition, the language is translated into Chinese (default) for output

At the DevDay event held on October 1, 2024, OpenAI announced the launch of the Whisper large-v3-turbo speech transcription model, which has a total of 809 million parameters with almost no loss in quality.8 times faster than large-v3

The Whisper large-v3-turbo speech transcription model is an optimized version of large-v3 and has only 4 decoder layers, compared to large-v3 which has 32 layers. 809 million parameters, slightly larger than the medium model with 769 million parameters, but much smaller than the large model with 1.55 billion parameters,And the required VRAM is 6 GB, while the large model requires 10 GB.

2. Operation steps

After starting the container, click the API address to enter the Web interface

We give three functions for speech recognition (transcribe) or translation (translate):

  • Microphone Directly use the device for real-time recording
  • Audio file Upload offline audio
  • YouTube Online Video

1. Microphone uses the device directly for real-time recording

Click Microphone (default), use the device microphone to record audio, upload the audio to the platform after recording, select transcription or translation, and then click Submit to generate the specified text. (Due to model performance reasons, the translation may be inaccurate)

Figure 1 YouTube function operation process

2. Audio file upload offline audio

Click Audio file, upload or drag the audio to be executed into the interface, select transcription or translation, and then click Submit to generate the specified text.

Figure 2 YouTube function operation process

3. Youtube online video (Due to network problems, it may not be recognized and requires multiple attempts. Demo is for reference only)

Browse the Youtube webpage and find the video you want. Click Share on the right and a URL will appear. Copy this URL into the text box on the webpage. YouTube URL  , select Transcribe or Translate, and then click Submit to generate the specified text.

Figure 3 Obtaining YouTube URL

Figure 4 YouTube function operation process

Exchange and discussion

🖌️ If you see a high-quality project, please leave a message in the background to recommend it! In addition, we have also established a tutorial exchange group. Welcome friends to scan the QR code and remark [SD Tutorial] to join the group to discuss various technical issues and share application effects↓