Position-Weighted Self-Distillation Improves Reasoning Model Reliability

other · 2026-05-23

A new arXiv paper (2605.21606) introduces a method to improve on-policy self-distillation (OPSD) for reasoning tasks. Standard OPSD weights all tokens equally, but teacher entropy can be ambiguous—reflecting either uncertainty or solution diversity. The authors propose a branch-viability diagnostic that tests next-token alternatives from a privileged teacher prompt. Using Qwen3-4B, they find that an oriented within-sequence position score reliably indicates token reliability. This position-weighted approach enhances student model performance by selectively trusting teacher targets.

Key facts

Paper ID: arXiv:2605.21606
Focuses on on-policy self-distillation (OPSD) for reasoning
Standard OPSD treats all generated tokens equally
Teacher entropy can indicate uncertainty or solution diversity
Introduces branch-viability diagnostic to identify reliable tokens
Uses Qwen3-4B model for experiments
Oriented within-sequence position score is key finding
Method improves student model reasoning reliability

Position-Weighted Self-Distillation Improves Reasoning Model Reliability

Key facts

Entities

Institutions

Sources