Mitigating Dual Exposure Biases in LLM Reasoning Distillation

ai-technology · 2026-05-20

A new arXiv paper (2605.19433) identifies a fundamental dilemma in LLM reasoning distillation: off-policy distillation causes exposure bias from training-inference mismatch, while on-policy distillation introduces reciprocal reversed exposure bias where teacher models struggle with student-generated contexts. The authors propose a method to mitigate both biases.