Trust-Region Behavior Blending Improves On-Policy Distillation

other · 2026-06-01

Researchers have introduced Trust-Region Behavior Blending (TRB), a warmup technique for on-policy distillation (OPD) aimed at improving the quality of initial student rollouts. OPD involves training a student using its own policy prefixes while aligning with a more proficient teacher, but the early rollouts tend to be subpar. TRB enhances this by substituting the initial rollout policy with the behavior policy closest to the teacher within a student-focused KL trust region, while maintaining the per-prefix reverse-KL OPD loss. The KL budget is gradually reduced to zero, allowing a return to standard student rollouts post-warmup. In two math-reasoning distillation scenarios, TRB outperforms all other methods evaluated.

Key facts

TRB is a warmup method for on-policy distillation (OPD).
OPD trains a student on prefixes from its own policy while matching a teacher.
Early student rollouts in OPD can be poor, placing supervision on weak prefixes.
TRB replaces early rollout policy with the closest-to-teacher behavior policy.
Replacement occurs inside a student-centered KL trust region.
The per-prefix reverse-KL OPD loss remains unchanged.
KL budget is annealed to zero, returning to pure student rollouts after warmup.
TRB attains strongest average across two math-reasoning distillation settings.

Entities

—

Sources

arXiv cs.AI — 2026-06-01