vOPD: Stabilizing On-Policy Distillation for LLMs with Control Variate Baseline

ai-technology · 2026-05-11

Researchers have introduced a new technique named vOPD, which stands for On-Policy Distillation with a control variate baseline. This approach aims to resolve the instability issues encountered in large language models while performing on-policy distillation (OPD). Typically, OPD is applied for post-training reasoning tasks, but it often suffers from gradient fluctuations due to its reliance on a single-sample Monte Carlo estimator. vOPD enhances stability by approaching OPD as policy-gradient reinforcement learning and integrating a value function as a control variate baseline. This value function is derived during the forward pass, eliminating the necessity for additional critic networks.

Key facts

vOPD stands for On-Policy Distillation with a control variate baseline
OPD is a dominant post-training paradigm for large language models, especially for reasoning
OPD is unstable due to high gradient variance of its single-sample Monte Carlo estimator
vOPD casts OPD as policy-gradient RL
vOPD introduces a control variate baseline (value function) from RL literature
The value function has a closed form as per-token negative reverse KL divergence between student and teacher
The closed form is available directly from the already-computed forward pass
No additional critic or inference is needed
Existing methods compute full token-level reverse KL over entire vocabulary or restrict to top-k support, adding overhead
The paper is available on arXiv with ID 2605.07865

vOPD: Stabilizing On-Policy Distillation for LLMs with Control Variate Baseline

Key facts

Entities

Institutions

Sources