On-Policy Distillation for LLMs: Pitfalls and Fixes
A new study from arXiv investigates on-policy distillation (OPD) and on-policy self-distillation (OPSD) for large language models. The research finds OPD on mathematical reasoning is sensitive to teacher choice and loss formulation, while OPSD fails without instance-specific privileged information at test time. OPSD works when privileged information represents a shared latent rule like a system prompt. Three failure mechanisms are identified: distribution mismatch between teacher and student, among others. The study provides a comprehensive empirical analysis of when these methods succeed or fail.
Key facts
- On-policy distillation (OPD) and on-policy self-distillation (OPSD) are post-training methods for LLMs.
- OPD on mathematical reasoning is highly sensitive to teacher choice and loss formulation.
- OPSD fails in tested settings due to absence of instance-specific privileged information at test time.
- OPSD is effective when privileged information represents a shared latent rule.
- Three failure mechanisms are identified, including distribution mismatch.
- The study is published on arXiv with ID 2605.11182.
- The research offers dense token-level supervision on trajectories from the model's own policy.
- Existing results on OP(S)D effectiveness are mixed.
Entities
Institutions
- arXiv