HyperAIHyperAI

Granary European Speech Recognition and Translation Dataset

Download Help

Granary is a large-scale multilingual speech dataset released by NVIDIA's multi-site research team in 2025. The related paper results are "Granary: Speech Recognition and Translation Dataset in 25 European Languages", which aims to provide high-quality training and evaluation materials for multilingual ASR/AST models.

This dataset contains approximately 1 million hours of high-quality pseudo-labeled ASR speech data, covering 25 European languages (including 23 EU languages, as well as Ukrainian and Russian). The data is sourced from publicly available speech corpora and processed through a unified pseudo-labeling and quality filtering process.

Languages include:

Bulgarian, Czech, Danish, German, Greek, English, Spanish, Estonian, Finnish, French, Croatian, Hungarian, Italian, Lithuanian, Latvian, Maltese, Dutch, Polish, Portuguese, Romanian, Slovak, Slovenian, Swedish, Ukrainian and Russian.