ARTFEED — Contemporary Art Intelligence

Trust-Region Behavior Blending Improves On-Policy Distillation

other · 2026-06-01

Researchers have introduced Trust-Region Behavior Blending (TRB), a warmup technique for on-policy distillation (OPD) aimed at improving the quality of initial student rollouts. OPD involves training a student using its own policy prefixes while aligning with a more proficient teacher, but the early rollouts tend to be subpar. TRB enhances this by substituting the initial rollout policy with the behavior policy closest to the teacher within a student-focused KL trust region, while maintaining the per-prefix reverse-KL OPD loss. The KL budget is gradually reduced to zero, allowing a return to standard student rollouts post-warmup. In two math-reasoning distillation scenarios, TRB outperforms all other methods evaluated.

Key facts

  • TRB is a warmup method for on-policy distillation (OPD).
  • OPD trains a student on prefixes from its own policy while matching a teacher.
  • Early student rollouts in OPD can be poor, placing supervision on weak prefixes.
  • TRB replaces early rollout policy with the closest-to-teacher behavior policy.
  • Replacement occurs inside a student-centered KL trust region.
  • The per-prefix reverse-KL OPD loss remains unchanged.
  • KL budget is annealed to zero, returning to pure student rollouts after warmup.
  • TRB attains strongest average across two math-reasoning distillation settings.

Entities

Sources