Lookahead Group Reward Mitigates Supervision Fidelity Decay in On-Policy Distillation

ai-technology · 2026-06-01

Researchers identify Supervision Fidelity Decay (SFD) as a critical bottleneck in on-policy distillation, where teacher confidence drops as student prefixes lengthen, weakening corrective signals. They propose Lookahead Group Reward (LGR) to evaluate student tokens by future teacher confidence, improving long-chain reasoning.

Key facts

arXiv:2605.30833
On-policy distillation uses token-level teacher feedback on student-generated trajectories.
Supervision Fidelity Decay (SFD) reduces teacher confidence with longer student prefixes.
SFD causes student drift in long reasoning chains.
Lookahead Group Reward (LGR) evaluates top-K candidate tokens by induced teacher confidence.
LGR assigns group-normalized rewards.
LGR is designed for computational efficiency.
The paper introduces an entropy-based mechanism for efficiency.

Lookahead Group Reward Mitigates Supervision Fidelity Decay in On-Policy Distillation

Key facts

Entities

Institutions

Sources