HyperAIHyperAI

Command Palette

Search for a command to run...

Open ASR Leaderboard Reveals Trends in Multilingual and Long-Form Speech Recognition with Conformer-LLM Models Leading Accuracy and Efficiency Gains

As of November 21, 2025, the Open ASR Leaderboard has become a central benchmark for evaluating open and closed-source automatic speech recognition models across accuracy, efficiency, multilingual capability, and long-form transcription. With over 150 Audio-Text-to-Text models and 27,000 ASR models available on the Hugging Face Hub, selecting the right model is increasingly complex. Most traditional benchmarks focus narrowly on short-form English transcription under 30 seconds, leaving critical areas like multilingual performance and real-time throughput underexplored. The updated Open ASR Leaderboard now includes dedicated tracks for multilingual and long-form audio, addressing these gaps. It evaluates 60+ models from 18 organizations across 11 datasets, offering a more comprehensive view of model capabilities. Key trends from the latest analysis include: Conformer encoders paired with large language model (LLM) decoders are leading in English transcription accuracy. Models like NVIDIA’s Canary-Qwen-2.5B, IBM’s Granite-Speech-3.3-8B, and Microsoft’s Phi-4-Multimodal-Instruct achieve some of the lowest word error rates (WER). This performance boost comes from the LLM’s ability to apply contextual reasoning and language modeling to improve transcription quality. NVIDIA’s Fast Conformer, a 2x faster version of the standard Conformer, is used in their Canary and Parakeet series, offering a speed-accuracy balance. A clear speed-accuracy tradeoff exists. While LLM decoders deliver top accuracy, they are significantly slower. Efficiency is measured using inverse real-time factor (RTFx), where higher values indicate faster inference. For real-time or high-throughput applications like meeting transcriptions, CTC and TDT decoders are far more efficient—offering 10 to 100 times faster processing—though with slightly higher WER. These are ideal for batch processing or live transcription tasks. Multilingual performance remains a key challenge. OpenAI’s Whisper Large v3 remains a strong baseline, supporting 99 languages. However, fine-tuned or distilled versions like Distil-Whisper and CrisperWhisper often outperform the original on English-only tasks, demonstrating that specialization through targeted training can improve performance. On the other hand, models focused on English tend to sacrifice multilingual coverage. Self-supervised models like Meta’s MMS and Omnilingual ASR support over 1,000 languages but lag behind language-specific encoders in accuracy. The leaderboard currently covers five languages but plans to expand, welcoming community contributions via GitHub. Long-form transcription presents a different set of challenges. Closed-source models still lead in this domain, likely due to domain-specific tuning, advanced chunking strategies, and production-level optimizations. Among open models, Whisper Large v3 performs best, but CTC-based Conformers like NVIDIA’s Parakeet CTC 1.1B achieve remarkable throughput—RTFx of 2793.75 versus 68.56 for Whisper—while maintaining a modest increase in WER (6.68 vs 6.43). This highlights the tradeoff between multilingual support and performance, as Parakeet is English-only. The Open ASR Leaderboard continues to grow as a community-driven resource, fostering transparency, model sharing, and benchmarking. It serves as a model for other language-specific leaderboards, such as those for Arabic and Russian, which are also advancing through community collaboration. The future of ASR lies in balancing accuracy, speed, multilingualism, and scalability—areas where open-source innovation can make a major impact. The team behind the leaderboard encourages contributions through GitHub, inviting researchers and developers to help shape the next generation of speech recognition.

Related Links