Adaptive Teacher Exposure Improves LLM Reasoning Self-Distillation
A new paper on arXiv (2605.11458) challenges the default practice in on-policy self-distillation for large language model (LLM) reasoning, where a teacher model always sees the full reference reasoning. The authors identify a 'teacher-side exposure mismatch': conditioning on reasoning beyond the student's current competence produces targets too difficult to learn from. A controlled fixed-exposure sweep shows that full exposure is not always optimal and that mismatch grows as the teacher sees more privileged reasoning. They propose Adaptive Teacher Exposure, treating exposure as a learnable training-time variable. The method is evaluated on mathematical reasoning benchmarks, demonstrating improved student performance. The work was submitted on May 26, 2025.
Key facts
- Paper on arXiv: 2605.11458
- Submitted May 26, 2025
- Focuses on on-policy self-distillation for LLM reasoning
- Identifies teacher-side exposure mismatch
- Full exposure is not always the best choice
- Student-teacher mismatch grows with more privileged reasoning
- Proposes Adaptive Teacher Exposure as a learnable control variable
- Evaluated on mathematical reasoning benchmarks
Entities
Institutions
- arXiv