Trust-Region Behavior Blending Improves On-Policy Distillation
Researchers have introduced Trust-Region Behavior Blending (TRB), a warmup technique for on-policy distillation (OPD) aimed at improving the quality of initial student rollouts. OPD involves training a student using its own policy prefixes while aligning with a more proficient teacher, but the early rollouts tend to be subpar. TRB enhances this by substituting the initial rollout policy with the behavior policy closest to the teacher within a student-focused KL trust region, while maintaining the per-prefix reverse-KL OPD loss. The KL budget is gradually reduced to zero, allowing a return to standard student rollouts post-warmup. In two math-reasoning distillation scenarios, TRB outperforms all other methods evaluated.
Key facts
- TRB is a warmup method for on-policy distillation (OPD).
- OPD trains a student on prefixes from its own policy while matching a teacher.
- Early student rollouts in OPD can be poor, placing supervision on weak prefixes.
- TRB replaces early rollout policy with the closest-to-teacher behavior policy.
- Replacement occurs inside a student-centered KL trust region.
- The per-prefix reverse-KL OPD loss remains unchanged.
- KL budget is annealed to zero, returning to pure student rollouts after warmup.
- TRB attains strongest average across two math-reasoning distillation settings.
Entities
—