EGRSD: Entropy-Guided Self-Distillation for Efficient LLM Reasoning

ai-technology · 2026-05-14

A novel technique known as EGRSD (Entropy-Guided Reinforced Self-Distillation) enhances on-policy self-distillation for training reasoning models. Current methods apply uniform weights to teacher token-level signals throughout chain-of-thought sequences, neglecting the differences in teacher predictive entropy. EGRSD integrates token-level updates through three distinct signals: a reward-based direction, a teacher-student likelihood-ratio magnitude, and a teacher-entropy confidence gate that reduces the weight of high-entropy tokens while ensuring a nonzero lower limit. Additionally, a causal-lookahead version, CL-EGRSD, differentiates between sustained and transient high-entropy spans. The findings from the experiments are detailed in the paper.

Key facts

EGRSD stands for Entropy-Guided Reinforced Self-Distillation.
It addresses uniform weighting of teacher token-level signals in on-policy self-distillation.
The method uses three signals: reward-grounded direction, likelihood-ratio magnitude, and entropy confidence gate.
The entropy gate down-weights high-entropy token positions with a nonzero lower bound.
CL-EGRSD is a causal-lookahead variant distinguishing sustained from transient high-entropy spans.
The paper is available on arXiv with ID 2605.13255.
The approach targets efficient LLM reasoning.
Experiments are conducted to validate the method.

EGRSD: Entropy-Guided Self-Distillation for Efficient LLM Reasoning

Key facts

Entities

Institutions

Sources