MAD-OPD: Multi-Agent Debate Breaks Teacher Ceiling in On-Policy Distillation
Researchers have introduced MAD-OPD (Multi-Agent Debate-driven On-Policy Distillation), a technique that addresses the limitations of a single-teacher approach in on-policy distillation. This method utilizes a group of teachers who engage in discussions regarding the student's on-policy state, leading to a collective intelligence that provides token-level guidance, with each teacher's input adjusted based on their confidence after the debate. To apply OPD to agentic tasks, the authors present On-Policy Agentic Distillation (OPAD), which incorporates step-level sampling to enhance training stability amid multi-step error accumulation. This research is available on arXiv (2605.01347).
Key facts
- MAD-OPD uses multi-agent debate to break the single-teacher ceiling in on-policy distillation.
- Teachers debate over the student's on-policy state to produce emergent collective intelligence.
- Each teacher's contribution is weighted by its post-debate confidence.
- OPAD adds step-level sampling to stabilize training for agentic tasks.
- The paper is available on arXiv with ID 2605.01347.
- On-policy distillation trains a student on its own trajectories under token-level teacher supervision.
- Existing OPD methods are capped by a single-teacher capability ceiling.
- OPD was largely unexplored in agentic tasks before this work.
Entities
Institutions
- arXiv