“`html
NVIDIA’s Granary Dataset Boosts Multilingual Speech AI
By Lena Harper | SAN FRANCISCO – 2025/08/17 14:46:19
NVIDIA is addressing the limited support for the world’s 7,000 languages in AI language models with a new dataset and models. Thes resources aim to foster the progress of high-quality speech recognition and translation AI for 25 European languages,including those with scarce data like Croatian,Estonian,and Maltese.
These advancements are designed to help developers scale AI applications, providing global users with rapid and precise speech technology. Potential applications include multilingual chatbots, customer service voice agents, and near-real-time translation services. The offerings include:
- Granary: A large, open-source collection of multilingual speech datasets, featuring approximately one million hours of audio, with nearly 650,000 hours for speech recognition and over 350,000 hours for speech translation.
- NVIDIA canary-1b-v2: A one billion-parameter model, trained on Granary, for transcribing European languages and translating between English and two dozen supported languages. It tops Hugging Face’s leaderboard for multilingual speech recognition accuracy among open models.
- NVIDIA Parakeet-tdt-0.6b-v3: A 600-million-parameter model, optimized for real-time or high-volume transcription of languages supported by Granary. It achieves the highest throughput among multilingual models on the Hugging Face leaderboard, measured by the duration of audio transcribed per computation time.
The research paper on granary will be presented at Interspeech in the Netherlands from August 17-21. The dataset, along with the Canary and Parakeet models, is available on Hugging Face.
How Granary Overcomes Data Limitations
“Granary provides a critical resource to develop more inclusive speech technologies that better reflect the linguistic diversity of the continent.”
The NVIDIA speech AI team collaborated with researchers from Carnegie Mellon University and Fondazione Bruno Kessler to develop the Granary dataset. They processed unlabeled audio using the NVIDIA NEMO SPEech Data Processor toolkit, transforming it into structured, high-quality data.
This process allowed the team to convert public speech data into a format suitable for AI training, reducing the need for extensive human annotation. The processing pipeline is available in open source on github.
Granary offers clean, ready-to-use data, enabling developers to build models for transcription and translation in nearly all 24 official languages of the European Union, as well as Russian and Ukrainian.
For European languages with limited human-annotated datasets, Granary serves as a key resource for creating more inclusive speech technologies that accurately represent the continent’s linguistic diversity, while requiring less training data.
The team’s Interspeech paper demonstrates that Granary requires approximately half the training data of other datasets to achieve a target accuracy level for automatic speech recognition (ASR) and automatic speech translation (AST).
