ARTFEED — Contemporary Art Intelligence

TCOD Framework Improves Multi-Turn Agent Distillation

ai-technology · 2026-04-29

Researchers identify Trajectory-Level KL Instability in on-policy distillation for multi-turn autonomous agents, where KL divergence increases alongside a drop in success rate and remains high after convergence due to inter-turn error compounding. They propose TCOD (Temporal Curriculum On-Policy Distillation), which controls and progressively expands trajectory depth exposed to the student, stabilizing training. The work is published on arXiv (2604.24005).

Key facts

  • On-policy distillation (OPD) transfers reasoning from large models to smaller students.
  • Vanilla OPD faces Trajectory-Level KL Instability in multi-turn settings.
  • KL divergence increases with success rate drop and remains high after convergence.
  • Instability arises from inter-turn error compounding.
  • TCOD controls trajectory depth and progressively expands it.
  • TCOD stands for Temporal Curriculum On-Policy Distillation.
  • The paper is on arXiv with ID 2604.24005.
  • The announcement type is cross.

Entities

Institutions

  • arXiv

Sources