Lookahead Group Reward Mitigates Supervision Fidelity Decay in On-Policy Distillation
Researchers identify Supervision Fidelity Decay (SFD) as a critical bottleneck in on-policy distillation, where teacher confidence drops as student prefixes lengthen, weakening corrective signals. They propose Lookahead Group Reward (LGR) to evaluate student tokens by future teacher confidence, improving long-chain reasoning.
Key facts
- arXiv:2605.30833
- On-policy distillation uses token-level teacher feedback on student-generated trajectories.
- Supervision Fidelity Decay (SFD) reduces teacher confidence with longer student prefixes.
- SFD causes student drift in long reasoning chains.
- Lookahead Group Reward (LGR) evaluates top-K candidate tokens by induced teacher confidence.
- LGR assigns group-normalized rewards.
- LGR is designed for computational efficiency.
- The paper introduces an entropy-based mechanism for efficiency.
Entities
Institutions
- arXiv