Early Stopping Rollout Improves On-Policy Distillation Efficiency

ai-technology · 2026-05-27

A new machine learning technique called Early Stopping Rollout (ESR) addresses the 'Off-policy Teacher Decay' problem in on-policy distillation. In this paradigm, a student model is trained by scoring its own rollouts with a teacher model, but the teacher's corrective ability decays for later tokens due to off-policy context. ESR restricts rollout generation to the first response tokens, outperforming full rollout across model sizes, families, tasks, and training regimes, while improving GPU efficiency and training stability, especially in cross-model-family scenarios. The paper is published on arXiv with ID 2605.27028.

Key facts

On-policy distillation uses teacher-scored student rollouts for training.
Off-policy Teacher Decay problem reduces teacher effectiveness for later tokens.
Early Stopping Rollout (ESR) limits rollout to first response tokens.
ESR surpasses full rollout performance across model size, family, tasks, and training regime.
ESR exhibits higher GPU efficiency and training stability.
Improvements are especially notable under cross-model-family scenarios.
The paper is available on arXiv (ID: 2605.27028).
The technique is simple yet effective.

Early Stopping Rollout Improves On-Policy Distillation Efficiency

Key facts

Entities

Institutions

Sources