ARTFEED — Contemporary Art Intelligence

Lookahead Group Reward Mitigates Supervision Fidelity Decay in On-Policy Distillation

ai-technology · 2026-06-01

Researchers identify Supervision Fidelity Decay (SFD) as a critical bottleneck in on-policy distillation, where teacher confidence drops as student prefixes lengthen, weakening corrective signals. They propose Lookahead Group Reward (LGR) to evaluate student tokens by future teacher confidence, improving long-chain reasoning.

Key facts

  • arXiv:2605.30833
  • On-policy distillation uses token-level teacher feedback on student-generated trajectories.
  • Supervision Fidelity Decay (SFD) reduces teacher confidence with longer student prefixes.
  • SFD causes student drift in long reasoning chains.
  • Lookahead Group Reward (LGR) evaluates top-K candidate tokens by induced teacher confidence.
  • LGR assigns group-normalized rewards.
  • LGR is designed for computational efficiency.
  • The paper introduces an entropy-based mechanism for efficiency.

Entities

Institutions

  • arXiv

Sources