Teacher-Guided Policy Optimization Improves LLM Distillation
The recently introduced Teacher-Guided Policy Optimization (TGPO) algorithm tackles a significant drawback in Reverse KL (RKL) distillation for large language models. In cases where the distributions of the student and teacher diverge greatly, traditional RKL produces unhelpful negative feedback and does not enhance performance. TGPO enhances this process by providing dense directional guidance, conditioning teacher predictions based on the student's rollout, while staying on-policy and fitting smoothly within current RLVR frameworks without requiring additional data annotation. Testing on intricate reasoning benchmarks reveals that TGPO substantially surpasses standard baselines and demonstrates resilience across various teacher models.
Key facts
- TGPO is an on-policy algorithm for LLM distillation.
- It addresses a limitation in Reverse KL (RKL) when student and teacher distributions diverge.
- TGPO incorporates dense directional guidance by leveraging teacher predictions conditioned on the student's rollout.
- It integrates with existing RLVR frameworks without additional data annotation.
- Experiments on complex reasoning benchmarks show TGPO outperforms standard baselines.
- TGPO is robust to different teachers.
- The paper is from Computer Science > Machine Learning.
- The submission is on arXiv.
Entities
Institutions
- arXiv