Vividh-ASR Benchmark Exposes Studio-Bias in Multilingual ASR

other · 2026-05-14

A phenomenon known as studio-bias has been discovered in multilingual ASR models such as Whisper, where fine-tuning on languages with limited resources enhances read speech but negatively impacts spontaneous audio performance. To tackle this issue, researchers developed Vividh-ASR, a benchmark stratified by complexity for Hindi and Malayalam, encompassing four categories: studio, broadcast, spontaneous, and synthetic noise. A controlled examination of learning-rate timing and curriculum sequencing indicated that implementing early significant parameter updates boosts the global Word Error Rate (WER) by 12 absolute points, with a hard-to-easy curriculum providing further improvements for spontaneous speech. These insights led to the creation of reverse multi-stage fine-tuning (R-MFT), allowing a 244M Whisper model to achieve or surpass the performance of traditionally fine-tuned 769M models. Analysis through CKA and SVD indicated that effective training schedules focus adaptation in the decoder while maintaining the acoustic representations of the pre-trained encoder. This benchmark and methodology aim to enhance the resilience of ASR systems for Indic languages in real-world spontaneous contexts.

Key facts

Studio-bias degrades spontaneous audio performance in fine-tuned multilingual ASR models.
Vividh-ASR is a complexity-stratified benchmark for Hindi and Malayalam across four tiers.
Early large parameter updates improve global WER by 12 absolute points.
Hard-to-easy curriculum adds gains for spontaneous speech.
Reverse multi-stage fine-tuning (R-MFT) enables a 244M Whisper model to match 769M counterparts.
CKA and SVD analysis shows adaptation concentrates in the decoder.
The study focuses on low-resource Indic languages.
The benchmark includes synthetic noise tier.

Entities

—

Sources

arXiv cs.AI — 2026-05-14