Direction-Adaptive Self-Distillation Improves LLM Reasoning

other · 2026-05-23

On-policy self-distillation (OPSD) is a novel post-training approach for large language models (LLMs) where the model acts as its own instructor, utilizing privileged data such as reference traces or hints to deliver dense supervision at the token level during its rollouts. Nevertheless, recent research indicates that OPSD can hinder complex reasoning by limiting predictive uncertainty, which is essential for exploration and revising hypotheses. Analyzing token-level performance shows that this issue stems from uniform supervision across tokens with varying uncertainty levels: conformity restricts exploration in high-entropy scenarios, while deviation negatively impacts accuracy in low-entropy contexts. To tackle this, researchers introduce Direction-Adaptive Self-Distillation (DASD), which shifts the focus from uniform imitation to entropy-guided directional supervision. This study is available on arXiv with the identifier 2605.22263.

Key facts

On-policy self-distillation (OPSD) is an emerging LLM post-training paradigm.
OPSD uses the model as its own teacher with privileged information.
OPSD degrades complex reasoning by suppressing predictive uncertainty.
Token-level analysis shows uniform teacher supervision causes the failure.
Direction-Adaptive Self-Distillation (DASD) is proposed as a solution.
DASD reframes supervision into entropy-routed directional supervision.
The paper is available on arXiv with ID 2605.22263.
The research addresses the uniform direction of teacher supervision.

Direction-Adaptive Self-Distillation Improves LLM Reasoning

Key facts

Entities

Institutions

Sources