IBM’s Granite Speech Model Tops Open ASR Hugging Face Leaderboard

Granite was trained on a diverse set of public audio datasets representing multiple dialects and speech contexts.

IBM’s Granite Speech Model Tops Open ASR Hugging Face Leaderboard

IBM claims its Granite Speech 3.3 8B model took the top spot on Hugging Face’s Open ASR leaderboard, outperforming rivals like OpenAI’s Whisper and Meta’s speech models.

Granite Speech 3.3 8B is an advanced speech recognition model built on its Granite 3.3 8B Instruct base using modality alignment and LoRA fine-tuning.

It achieved an industry-leading average word error rate (WER) of 5.85, despite being significantly smaller in size compared to competing models.

"When tested on several different bodies of audio data, the Granite model had the lowest word error rate (or the highest accuracy), beating out several proprietary models as well," IBM said in a blog post.

Granite was trained on a diverse set of public audio datasets representing multiple dialects and speech contexts—from voicemails to earnings calls—making it highly adaptable to real-world use.

IBM enhanced training by injecting noise and cutting audio segments to boost resilience.

The model’s architecture incorporates state-of-the-art components like conformer-based encoders and window query transformers, helping it handle complex speech scenarios.

IBM credits its strong performance to balanced data sampling and acoustic encoder innovations. With this release, IBM underscores its commitment to building human-level speech AI in the coming decade.

Designed for enterprise use, the model excels at transcribing English speech and translating it into French, Spanish, German, Italian, Portuguese, Japanese, and Mandarin.

The model is fully open-source and available on Hugging Face.