Mega-ASR: Scaling Real-World Acoustic Simulation for Robust Speech Recognition

ai-technology · 2026-05-20

Researchers propose Mega-ASR, a unified framework for automatic speech recognition (ASR) in the wild, addressing the acoustic robustness bottleneck where models fail under severe, compositional distortions. The system combines scalable compound-data construction with progressive acoustic-to-semantic optimization. A new dataset, Voices-in-the-Wild-2M, covers 7 classic acoustic phenomena and 54 physically plausible compound scenarios. Training uses Acoustic-to-Semantic Progressive Supervised Fine-Tuning and Dual-Granularity WER-Gated Policy Optimization. On adverse-condition benchmarks, Mega-ASR achieves 45.69% vs. 54.01% on VOiCES R4-B-F and 21.49% vs. 29.34% on NOIZEUS S, outperforming prior state-of-the-art systems.

Key facts

Mega-ASR is a unified ASR-in-the-wild framework
Addresses acoustic robustness bottleneck with severe compositional distortions
Voices-in-the-Wild-2M dataset covers 7 classic acoustic phenomena and 54 compound scenarios
Uses Acoustic-to-Semantic Progressive Supervised Fine-Tuning
Uses Dual-Granularity WER-Gated Policy Optimization
Achieves 45.69% vs. 54.01% on VOiCES R4-B-F
Achieves 21.49% vs. 29.34% on NOIZEUS S
Outperforms prior state-of-the-art on adverse-condition ASR benchmarks

Entities

—

Sources

arXiv cs.AI — 2026-05-20