Decoupling KL and Trajectories in LLM Distillation

other · 2026-05-20

A recent study published on arXiv (2605.16826) investigates knowledge distillation in the post-training phase of large language models (LLMs). It highlights that common approaches, such as off-policy and on-policy distillation (OPD), inherently link two distinct choices: the prefix source and the direction of token-level KL divergence. By breaking down sequence-level KL across autoregressive response distributions, the researchers demonstrate that forward KL aligns teacher prefixes with token-level forward KL, while reverse KL aligns student prefixes with token-level reverse KL. They contend that this connection is not essential and that separating these dimensions results in four legitimate objectives. The paper offers a comprehensive view of these methodologies.

Key facts

Paper arXiv:2605.16826 analyzes LLM distillation
Off-policy and on-policy distillation couple prefix source and token-level KL direction
Decoupling yields four valid objectives
Forward KL gives SFT-style cross-entropy matching
Reverse KL gives RL-style policy-gradient objective
Connects to off-policy SFT, DAgger, offline RL, OPD

Decoupling KL and Trajectories in LLM Distillation

Key facts

Entities

Institutions

Sources