Vividh-ASR Benchmark Exposes Studio-Bias in Multilingual ASR
A phenomenon known as studio-bias has been discovered in multilingual ASR models such as Whisper, where fine-tuning on languages with limited resources enhances read speech but negatively impacts spontaneous audio performance. To tackle this issue, researchers developed Vividh-ASR, a benchmark stratified by complexity for Hindi and Malayalam, encompassing four categories: studio, broadcast, spontaneous, and synthetic noise. A controlled examination of learning-rate timing and curriculum sequencing indicated that implementing early significant parameter updates boosts the global Word Error Rate (WER) by 12 absolute points, with a hard-to-easy curriculum providing further improvements for spontaneous speech. These insights led to the creation of reverse multi-stage fine-tuning (R-MFT), allowing a 244M Whisper model to achieve or surpass the performance of traditionally fine-tuned 769M models. Analysis through CKA and SVD indicated that effective training schedules focus adaptation in the decoder while maintaining the acoustic representations of the pre-trained encoder. This benchmark and methodology aim to enhance the resilience of ASR systems for Indic languages in real-world spontaneous contexts.
Key facts
- Studio-bias degrades spontaneous audio performance in fine-tuned multilingual ASR models.
- Vividh-ASR is a complexity-stratified benchmark for Hindi and Malayalam across four tiers.
- Early large parameter updates improve global WER by 12 absolute points.
- Hard-to-easy curriculum adds gains for spontaneous speech.
- Reverse multi-stage fine-tuning (R-MFT) enables a 244M Whisper model to match 769M counterparts.
- CKA and SVD analysis shows adaptation concentrates in the decoder.
- The study focuses on low-resource Indic languages.
- The benchmark includes synthetic noise tier.
Entities
—