ARTFEED — Contemporary Art Intelligence

vOPD: Stabilizing On-Policy Distillation for LLMs with Control Variate Baseline

ai-technology · 2026-05-11

Researchers have introduced a new technique named vOPD, which stands for On-Policy Distillation with a control variate baseline. This approach aims to resolve the instability issues encountered in large language models while performing on-policy distillation (OPD). Typically, OPD is applied for post-training reasoning tasks, but it often suffers from gradient fluctuations due to its reliance on a single-sample Monte Carlo estimator. vOPD enhances stability by approaching OPD as policy-gradient reinforcement learning and integrating a value function as a control variate baseline. This value function is derived during the forward pass, eliminating the necessity for additional critic networks.

Key facts

  • vOPD stands for On-Policy Distillation with a control variate baseline
  • OPD is a dominant post-training paradigm for large language models, especially for reasoning
  • OPD is unstable due to high gradient variance of its single-sample Monte Carlo estimator
  • vOPD casts OPD as policy-gradient RL
  • vOPD introduces a control variate baseline (value function) from RL literature
  • The value function has a closed form as per-token negative reverse KL divergence between student and teacher
  • The closed form is available directly from the already-computed forward pass
  • No additional critic or inference is needed
  • Existing methods compute full token-level reverse KL over entire vocabulary or restrict to top-k support, adding overhead
  • The paper is available on arXiv with ID 2605.07865

Entities

Institutions

  • arXiv

Sources