Decoupling KL and Trajectories in LLM Distillation
A recent study published on arXiv (2605.16826) investigates knowledge distillation in the post-training phase of large language models (LLMs). It highlights that common approaches, such as off-policy and on-policy distillation (OPD), inherently link two distinct choices: the prefix source and the direction of token-level KL divergence. By breaking down sequence-level KL across autoregressive response distributions, the researchers demonstrate that forward KL aligns teacher prefixes with token-level forward KL, while reverse KL aligns student prefixes with token-level reverse KL. They contend that this connection is not essential and that separating these dimensions results in four legitimate objectives. The paper offers a comprehensive view of these methodologies.
Key facts
- Paper arXiv:2605.16826 analyzes LLM distillation
- Off-policy and on-policy distillation couple prefix source and token-level KL direction
- Decoupling yields four valid objectives
- Forward KL gives SFT-style cross-entropy matching
- Reverse KL gives RL-style policy-gradient objective
- Connects to off-policy SFT, DAgger, offline RL, OPD
Entities
Institutions
- arXiv