Mitigating Dual Exposure Biases in LLM Reasoning Distillation
A new arXiv paper (2605.19433) identifies a fundamental dilemma in LLM reasoning distillation: off-policy distillation causes exposure bias from training-inference mismatch, while on-policy distillation introduces reciprocal reversed exposure bias where teacher models struggle with student-generated contexts. The authors propose a method to mitigate both biases.
Key facts
- arXiv paper 2605.19433
- Addresses exposure biases in LLM reasoning distillation
- Off-policy distillation uses teacher-generated trajectories
- On-policy distillation uses student-generated trajectories
- Both approaches suffer from distinct biases
- Proposes mitigation for dual exposure biases
Entities
—