ARTFEED — Contemporary Art Intelligence

Mitigating Dual Exposure Biases in LLM Reasoning Distillation

ai-technology · 2026-05-20

A new arXiv paper (2605.19433) identifies a fundamental dilemma in LLM reasoning distillation: off-policy distillation causes exposure bias from training-inference mismatch, while on-policy distillation introduces reciprocal reversed exposure bias where teacher models struggle with student-generated contexts. The authors propose a method to mitigate both biases.

Key facts

  • arXiv paper 2605.19433
  • Addresses exposure biases in LLM reasoning distillation
  • Off-policy distillation uses teacher-generated trajectories
  • On-policy distillation uses student-generated trajectories
  • Both approaches suffer from distinct biases
  • Proposes mitigation for dual exposure biases

Entities

Sources