On-Policy Distillation Boosts Compact ASR Models with Limited Data
Researchers have introduced Ark-ASR, a language model with 0.6 billion parameters that is audio-conditioned and trained on just 100,000 hours of speech. This model employs on-policy distillation from the more robust Qwen-ASR teacher. Ark-ASR consistently surpasses the performance of supervised fine-tuning alone and exceeds the Qwen3-ASR-0.6B baseline across four out of five ASR benchmarks in Mandarin and English. Notably, it utilizes 100,000 hours of speech, while the Qwen3-Omni AuT encoder relies on 20 million hours. Although the larger Qwen3-ASR-1.7B model remains superior, these findings indicate that teacher-guided on-policy training can significantly enhance compact ASR models with a much lower audio resource investment.
Key facts
- Ark-ASR is a 0.6B-parameter audio-conditioned language model.
- Trained with 100k hours of speech.
- Uses on-policy distillation from a Qwen-ASR teacher.
- Outperforms Qwen3-ASR-0.6B baseline on four of five evaluation sets.
- Compared to 20M hours for Qwen3-Omni AuT encoder.
- Qwen3-ASR-1.7B remains stronger.
- Method closes gap for compact ASR models with limited audio budget.
- Evaluated on Mandarin and English ASR benchmarks.
Entities
—