On-Policy Distillation Boosts Compact ASR Models with Limited Data

other · 2026-05-28

Researchers have introduced Ark-ASR, a language model with 0.6 billion parameters that is audio-conditioned and trained on just 100,000 hours of speech. This model employs on-policy distillation from the more robust Qwen-ASR teacher. Ark-ASR consistently surpasses the performance of supervised fine-tuning alone and exceeds the Qwen3-ASR-0.6B baseline across four out of five ASR benchmarks in Mandarin and English. Notably, it utilizes 100,000 hours of speech, while the Qwen3-Omni AuT encoder relies on 20 million hours. Although the larger Qwen3-ASR-1.7B model remains superior, these findings indicate that teacher-guided on-policy training can significantly enhance compact ASR models with a much lower audio resource investment.

Key facts

Ark-ASR is a 0.6B-parameter audio-conditioned language model.
Trained with 100k hours of speech.
Uses on-policy distillation from a Qwen-ASR teacher.
Outperforms Qwen3-ASR-0.6B baseline on four of five evaluation sets.
Compared to 20M hours for Qwen3-Omni AuT encoder.
Qwen3-ASR-1.7B remains stronger.
Method closes gap for compact ASR models with limited audio budget.
Evaluated on Mandarin and English ASR benchmarks.

Entities

—

Sources

arXiv cs.AI — 2026-05-28