HyperAIHyperAI
Back to Headlines

NVIDIA Unveils Granary Dataset and New Speech AI Models to Boost Multilingual Support for Underrepresented European Languages

2 days ago

NVIDIA has unveiled a new open dataset and set of AI models aimed at advancing multilingual speech technology for underrepresented European languages. The initiative, called Granary, addresses a major gap in AI development: the limited support for the world’s 7,000+ languages, with only a small fraction currently backed by robust language models. Granary provides high-quality, ready-to-use training data for 25 European languages, including Croatian, Estonian, and Maltese—languages often overlooked due to scarce annotated resources. The dataset supports automatic speech recognition (ASR) and automatic speech translation (AST), enabling developers to build scalable, accurate speech AI applications such as multilingual chatbots, customer service voice agents, and real-time translation tools. The dataset was developed through a collaboration between NVIDIA’s speech AI team and researchers from Carnegie Mellon University and Fondazione Bruno Kessler. Using the NVIDIA NeMo Speech Data Processor toolkit, the team processed vast amounts of unlabeled audio through an automated pipeline that transformed raw data into structured, high-quality training material—without relying on expensive human annotation. This open-source pipeline is now available on GitHub, empowering developers to replicate the workflow for other languages or use cases. The Granary dataset includes data from all 24 official languages of the European Union, plus Russian and Ukrainian, offering a critical foundation for building more inclusive AI systems that reflect Europe’s linguistic diversity. In a paper to be presented at Interspeech 2025 in the Netherlands, researchers demonstrated that Granary enables models to achieve target accuracy levels with roughly half the training data required by other leading datasets. This efficiency is a major step forward in reducing the computational and time costs of training speech models. NVIDIA also released two new models—Canary-1b-v2 and Parakeet-tdt-0.6b-v3—built using Granary data. Canary-1b-v2, available under a permissive license, supports 25 languages and delivers transcription and translation quality on par with models three times its size, while running inference up to 10 times faster. It also includes accurate punctuation, capitalization, and word-level timestamps. Parakeet-tdt-0.6b-v3 is optimized for speed and scalability, capable of transcribing a full 24-minute audio segment in a single inference pass. It automatically detects the input language and performs transcription without requiring additional prompts, making it ideal for high-throughput applications. Both models were developed using NVIDIA NeMo, a modular AI software suite that streamlines the AI lifecycle. NeMo Curator helped filter out low-quality or synthetic audio samples, ensuring only high-fidelity data was used for training. The NeMo Speech Data Processor toolkit also handled critical preprocessing tasks like audio-transcript alignment and format conversion. The Granary dataset and both models are now publicly available on Hugging Face, with detailed documentation and code on GitHub. By sharing its tools, data, and methodology, NVIDIA is accelerating innovation in speech AI and enabling developers worldwide to build more accessible, multilingual voice technologies.

Related Links