EGRSD: Entropy-Guided Self-Distillation for Efficient LLM Reasoning
A novel technique known as EGRSD (Entropy-Guided Reinforced Self-Distillation) enhances on-policy self-distillation for training reasoning models. Current methods apply uniform weights to teacher token-level signals throughout chain-of-thought sequences, neglecting the differences in teacher predictive entropy. EGRSD integrates token-level updates through three distinct signals: a reward-based direction, a teacher-student likelihood-ratio magnitude, and a teacher-entropy confidence gate that reduces the weight of high-entropy tokens while ensuring a nonzero lower limit. Additionally, a causal-lookahead version, CL-EGRSD, differentiates between sustained and transient high-entropy spans. The findings from the experiments are detailed in the paper.
Key facts
- EGRSD stands for Entropy-Guided Reinforced Self-Distillation.
- It addresses uniform weighting of teacher token-level signals in on-policy self-distillation.
- The method uses three signals: reward-grounded direction, likelihood-ratio magnitude, and entropy confidence gate.
- The entropy gate down-weights high-entropy token positions with a nonzero lower bound.
- CL-EGRSD is a causal-lookahead variant distinguishing sustained from transient high-entropy spans.
- The paper is available on arXiv with ID 2605.13255.
- The approach targets efficient LLM reasoning.
- Experiments are conducted to validate the method.
Entities
Institutions
- arXiv