ARTFEED — Contemporary Art Intelligence

Mega-ASR: Scaling Real-World Acoustic Simulation for Robust Speech Recognition

ai-technology · 2026-05-20

Researchers propose Mega-ASR, a unified framework for automatic speech recognition (ASR) in the wild, addressing the acoustic robustness bottleneck where models fail under severe, compositional distortions. The system combines scalable compound-data construction with progressive acoustic-to-semantic optimization. A new dataset, Voices-in-the-Wild-2M, covers 7 classic acoustic phenomena and 54 physically plausible compound scenarios. Training uses Acoustic-to-Semantic Progressive Supervised Fine-Tuning and Dual-Granularity WER-Gated Policy Optimization. On adverse-condition benchmarks, Mega-ASR achieves 45.69% vs. 54.01% on VOiCES R4-B-F and 21.49% vs. 29.34% on NOIZEUS S, outperforming prior state-of-the-art systems.

Key facts

  • Mega-ASR is a unified ASR-in-the-wild framework
  • Addresses acoustic robustness bottleneck with severe compositional distortions
  • Voices-in-the-Wild-2M dataset covers 7 classic acoustic phenomena and 54 compound scenarios
  • Uses Acoustic-to-Semantic Progressive Supervised Fine-Tuning
  • Uses Dual-Granularity WER-Gated Policy Optimization
  • Achieves 45.69% vs. 54.01% on VOiCES R4-B-F
  • Achieves 21.49% vs. 29.34% on NOIZEUS S
  • Outperforms prior state-of-the-art on adverse-condition ASR benchmarks

Entities

Sources