Adaptive Teacher Exposure Improves LLM Reasoning Self-Distillation

other · 2026-05-13

A new paper on arXiv (2605.11458) challenges the default practice in on-policy self-distillation for large language model (LLM) reasoning, where a teacher model always sees the full reference reasoning. The authors identify a 'teacher-side exposure mismatch': conditioning on reasoning beyond the student's current competence produces targets too difficult to learn from. A controlled fixed-exposure sweep shows that full exposure is not always optimal and that mismatch grows as the teacher sees more privileged reasoning. They propose Adaptive Teacher Exposure, treating exposure as a learnable training-time variable. The method is evaluated on mathematical reasoning benchmarks, demonstrating improved student performance. The work was submitted on May 26, 2025.

Key facts

Paper on arXiv: 2605.11458
Submitted May 26, 2025
Focuses on on-policy self-distillation for LLM reasoning
Identifies teacher-side exposure mismatch
Full exposure is not always the best choice
Student-teacher mismatch grows with more privileged reasoning
Proposes Adaptive Teacher Exposure as a learnable control variable
Evaluated on mathematical reasoning benchmarks

Adaptive Teacher Exposure Improves LLM Reasoning Self-Distillation

Key facts

Entities

Institutions

Sources