On-Policy Distillation for LLMs: Pitfalls and Fixes

other · 2026-05-13

A new study from arXiv investigates on-policy distillation (OPD) and on-policy self-distillation (OPSD) for large language models. The research finds OPD on mathematical reasoning is sensitive to teacher choice and loss formulation, while OPSD fails without instance-specific privileged information at test time. OPSD works when privileged information represents a shared latent rule like a system prompt. Three failure mechanisms are identified: distribution mismatch between teacher and student, among others. The study provides a comprehensive empirical analysis of when these methods succeed or fail.

Key facts

On-policy distillation (OPD) and on-policy self-distillation (OPSD) are post-training methods for LLMs.
OPD on mathematical reasoning is highly sensitive to teacher choice and loss formulation.
OPSD fails in tested settings due to absence of instance-specific privileged information at test time.
OPSD is effective when privileged information represents a shared latent rule.
Three failure mechanisms are identified, including distribution mismatch.
The study is published on arXiv with ID 2605.11182.
The research offers dense token-level supervision on trajectories from the model's own policy.
Existing results on OP(S)D effectiveness are mixed.

On-Policy Distillation for LLMs: Pitfalls and Fixes

Key facts

Entities

Institutions

Sources