OPSD Compresses Rather Than Corrects in Long Reasoning Traces

publication · 2026-05-09

A recent research paper published on arXiv (2605.06188) examines On-Policy Self-Distillation (OPSD) as a substitute for Reinforcement Learning with Verifiable Rewards (RLVR) in reasoning models. OPSD enhances accuracy and minimizes response length in brief outputs through token-level credit assignment from a self-teacher; however, these advantages do not apply to mathematical reasoning that requires thinking. The improvements in accuracy diminish and can even turn negative. The authors suggest that hindsight supervision offers superior token choices in short outputs but primarily highlights redundancy in extended traces. When OPSD is tested independently on both correct and incorrect rollouts, it appears to function more as a compression tool than a correction method for lengthy reasoning sequences.

Key facts

OPSD is an alternative to RLVR for reasoning models.
OPSD uses token-level credit assignment from a self-teacher.
OPSD promises higher accuracy and shorter responses.
In thinking-enabled math reasoning, OPSD accuracy gains shrink or turn negative.
Hindsight supervision supplies better alternatives in short outputs.
In long traces, hindsight supervision identifies redundancy more than replacements.
OPSD was tested separately on correct and incorrect rollout groups.
OPSD behaves as a compression mechanism in long reasoning traces.

OPSD Compresses Rather Than Corrects in Long Reasoning Traces

Key facts

Entities

Institutions

Sources