Position-Weighted Self-Distillation Improves Reasoning Model Reliability
A new arXiv paper (2605.21606) introduces a method to improve on-policy self-distillation (OPSD) for reasoning tasks. Standard OPSD weights all tokens equally, but teacher entropy can be ambiguous—reflecting either uncertainty or solution diversity. The authors propose a branch-viability diagnostic that tests next-token alternatives from a privileged teacher prompt. Using Qwen3-4B, they find that an oriented within-sequence position score reliably indicates token reliability. This position-weighted approach enhances student model performance by selectively trusting teacher targets.
Key facts
- Paper ID: arXiv:2605.21606
- Focuses on on-policy self-distillation (OPSD) for reasoning
- Standard OPSD treats all generated tokens equally
- Teacher entropy can indicate uncertainty or solution diversity
- Introduces branch-viability diagnostic to identify reliable tokens
- Uses Qwen3-4B model for experiments
- Oriented within-sequence position score is key finding
- Method improves student model reasoning reliability
Entities
Institutions
- arXiv