vOPD: Stabilizing On-Policy Distillation for LLMs with Control Variate Baseline
Researchers have introduced a new technique named vOPD, which stands for On-Policy Distillation with a control variate baseline. This approach aims to resolve the instability issues encountered in large language models while performing on-policy distillation (OPD). Typically, OPD is applied for post-training reasoning tasks, but it often suffers from gradient fluctuations due to its reliance on a single-sample Monte Carlo estimator. vOPD enhances stability by approaching OPD as policy-gradient reinforcement learning and integrating a value function as a control variate baseline. This value function is derived during the forward pass, eliminating the necessity for additional critic networks.
Key facts
- vOPD stands for On-Policy Distillation with a control variate baseline
- OPD is a dominant post-training paradigm for large language models, especially for reasoning
- OPD is unstable due to high gradient variance of its single-sample Monte Carlo estimator
- vOPD casts OPD as policy-gradient RL
- vOPD introduces a control variate baseline (value function) from RL literature
- The value function has a closed form as per-token negative reverse KL divergence between student and teacher
- The closed form is available directly from the already-computed forward pass
- No additional critic or inference is needed
- Existing methods compute full token-level reverse KL over entire vocabulary or restrict to top-k support, adding overhead
- The paper is available on arXiv with ID 2605.07865
Entities
Institutions
- arXiv