ARTFEED — Contemporary Art Intelligence

On-Policy Distillation for LLMs: Pitfalls and Fixes

other · 2026-05-13

A new study from arXiv investigates on-policy distillation (OPD) and on-policy self-distillation (OPSD) for large language models. The research finds OPD on mathematical reasoning is sensitive to teacher choice and loss formulation, while OPSD fails without instance-specific privileged information at test time. OPSD works when privileged information represents a shared latent rule like a system prompt. Three failure mechanisms are identified: distribution mismatch between teacher and student, among others. The study provides a comprehensive empirical analysis of when these methods succeed or fail.

Key facts

  • On-policy distillation (OPD) and on-policy self-distillation (OPSD) are post-training methods for LLMs.
  • OPD on mathematical reasoning is highly sensitive to teacher choice and loss formulation.
  • OPSD fails in tested settings due to absence of instance-specific privileged information at test time.
  • OPSD is effective when privileged information represents a shared latent rule.
  • Three failure mechanisms are identified, including distribution mismatch.
  • The study is published on arXiv with ID 2605.11182.
  • The research offers dense token-level supervision on trajectories from the model's own policy.
  • Existing results on OP(S)D effectiveness are mixed.

Entities

Institutions

  • arXiv

Sources