Mega-ASR: Scaling Real-World Acoustic Simulation for Robust Speech Recognition
Researchers propose Mega-ASR, a unified framework for automatic speech recognition (ASR) in the wild, addressing the acoustic robustness bottleneck where models fail under severe, compositional distortions. The system combines scalable compound-data construction with progressive acoustic-to-semantic optimization. A new dataset, Voices-in-the-Wild-2M, covers 7 classic acoustic phenomena and 54 physically plausible compound scenarios. Training uses Acoustic-to-Semantic Progressive Supervised Fine-Tuning and Dual-Granularity WER-Gated Policy Optimization. On adverse-condition benchmarks, Mega-ASR achieves 45.69% vs. 54.01% on VOiCES R4-B-F and 21.49% vs. 29.34% on NOIZEUS S, outperforming prior state-of-the-art systems.
Key facts
- Mega-ASR is a unified ASR-in-the-wild framework
- Addresses acoustic robustness bottleneck with severe compositional distortions
- Voices-in-the-Wild-2M dataset covers 7 classic acoustic phenomena and 54 compound scenarios
- Uses Acoustic-to-Semantic Progressive Supervised Fine-Tuning
- Uses Dual-Granularity WER-Gated Policy Optimization
- Achieves 45.69% vs. 54.01% on VOiCES R4-B-F
- Achieves 21.49% vs. 29.34% on NOIZEUS S
- Outperforms prior state-of-the-art on adverse-condition ASR benchmarks
Entities
—