“`html

<a href="https://www.ign.com/articles/best-nvidia-graphics-cards" title="Best Nvidia Graphics Cards 2025 - IGN" target="_blank" rel="noopener">NVIDIA</a>‘s Granary Dataset Boosts <a href="https://www.50languages.com/" title="50LANGUAGES | Learn any language free online" target="_blank" rel="noopener">Multilingual</a> <a href="https://www.archynetys.com/president-prabowo-urges-judges-to-uphold-justice-without-discrimination/" title="President Prabowo Urges Judges to Uphold Justice Without Discrimination">Speech AI</a>

NVIDIA’s Granary Dataset Boosts Multilingual Speech AI

By Lena Harper | SAN FRANCISCO – 2025/08/17 14:46:19

NVIDIA is addressing the limited support for the world’s 7,000 languages in AI language models with a new dataset and models. Thes resources aim to foster the progress of high-quality speech recognition and translation AI for 25 European languages,including those with scarce data like Croatian,Estonian,and Maltese.

These advancements are designed to help developers scale AI applications, providing global users with rapid and precise speech technology. Potential applications include multilingual chatbots, customer service voice agents, and near-real-time translation services. The offerings include:

Granary: A large, open-source collection of multilingual speech datasets, featuring approximately one million hours of audio, with nearly 650,000 hours for speech recognition and over 350,000 hours for speech translation.
NVIDIA canary-1b-v2: A one billion-parameter model, trained on Granary, for transcribing European languages and translating between English and two dozen supported languages. It tops Hugging Face’s leaderboard for multilingual speech recognition accuracy among open models.
NVIDIA Parakeet-tdt-0.6b-v3: A 600-million-parameter model, optimized for real-time or high-volume transcription of languages supported by Granary. It achieves the highest throughput among multilingual models on the Hugging Face leaderboard, measured by the duration of audio transcribed per computation time.

The research paper on granary will be presented at Interspeech in the Netherlands from August 17-21. The dataset, along with the Canary and Parakeet models, is available on Hugging Face.

How Granary Overcomes Data Limitations

“Granary provides a critical resource to develop more inclusive speech technologies that better reflect the linguistic diversity of the continent.”

The NVIDIA speech AI team collaborated with researchers from Carnegie Mellon University and Fondazione Bruno Kessler to develop the Granary dataset. They processed unlabeled audio using the NVIDIA NEMO SPEech Data Processor toolkit, transforming it into structured, high-quality data.

This process allowed the team to convert public speech data into a format suitable for AI training, reducing the need for extensive human annotation. The processing pipeline is available in open source on github.

Granary offers clean, ready-to-use data, enabling developers to build models for transcription and translation in nearly all 24 official languages of the European Union, as well as Russian and Ukrainian.

For European languages with limited human-annotated datasets, Granary serves as a key resource for creating more inclusive speech technologies that accurately represent the continent’s linguistic diversity, while requiring less training data.

The team’s Interspeech paper demonstrates that Granary requires approximately half the training data of other datasets to achieve a target accuracy level for automatic speech recognition (ASR) and automatic speech translation (AST).

Speech AI explained

Speech AI, also known as speech recognition or automatic speech recognition (ASR), is the technology that enables machines to understand and transcribe human speech into text. It relies on machine learning models trained on vast amounts of audio data. IBM Cloud Learn Hub TechTarget

Key Milestones:

1950s: Early speech recognition systems emerge, capable of recognizing single words.
1990s: Hidden markov models (HMMs) revolutionize speech recognition accuracy.
2010s: Deep learning techniques, particularly neural networks, significantly improve ASR performance.
2020s: AI models handle multiple languages and dialects.

long-Term Trend: The global speech and voice recognition market is projected to reach $42.1 billion by 2029, growing at a CAGR of

NVIDIA Multilingual Speech AI: Dataset & Models Released

How Granary Overcomes Data Limitations

AEW x NJPW Forbidden Door Lineup & Wrestling News

Most Expensive Old Cell Phones 2025: Top Prices & Rare Models

Related Posts

Leave a Comment Cancel Reply