TCOD Framework Improves Multi-Turn Agent Distillation
Researchers identify Trajectory-Level KL Instability in on-policy distillation for multi-turn autonomous agents, where KL divergence increases alongside a drop in success rate and remains high after convergence due to inter-turn error compounding. They propose TCOD (Temporal Curriculum On-Policy Distillation), which controls and progressively expands trajectory depth exposed to the student, stabilizing training. The work is published on arXiv (2604.24005).
Key facts
- On-policy distillation (OPD) transfers reasoning from large models to smaller students.
- Vanilla OPD faces Trajectory-Level KL Instability in multi-turn settings.
- KL divergence increases with success rate drop and remains high after convergence.
- Instability arises from inter-turn error compounding.
- TCOD controls trajectory depth and progressively expands it.
- TCOD stands for Temporal Curriculum On-Policy Distillation.
- The paper is on arXiv with ID 2604.24005.
- The announcement type is cross.
Entities
Institutions
- arXiv