TCOD Framework Improves Multi-Turn Agent Distillation

ai-technology · 2026-04-29

Researchers identify Trajectory-Level KL Instability in on-policy distillation for multi-turn autonomous agents, where KL divergence increases alongside a drop in success rate and remains high after convergence due to inter-turn error compounding. They propose TCOD (Temporal Curriculum On-Policy Distillation), which controls and progressively expands trajectory depth exposed to the student, stabilizing training. The work is published on arXiv (2604.24005).

Key facts

On-policy distillation (OPD) transfers reasoning from large models to smaller students.
Vanilla OPD faces Trajectory-Level KL Instability in multi-turn settings.
KL divergence increases with success rate drop and remains high after convergence.
Instability arises from inter-turn error compounding.
TCOD controls trajectory depth and progressively expands it.
TCOD stands for Temporal Curriculum On-Policy Distillation.
The paper is on arXiv with ID 2604.24005.
The announcement type is cross.

TCOD Framework Improves Multi-Turn Agent Distillation

Key facts

Entities

Institutions

Sources