MOPD: Multi-Rollout On-Policy Distillation for LLMs
A new framework called Multi-Rollout On-Policy Distillation (MOPD) has been developed by researchers to enhance post-training large language models using sparse verifier rewards. MOPD utilizes the local rollout group of the student to generate more insightful teacher signals by taking into account both successful and unsuccessful peer rollouts. Successful outcomes reinforce valid reasoning patterns, while failures provide structured insights into errors to avoid. The framework investigates two peer-context configurations: positive peer imitation and contrastive success-failure. This method overcomes the shortcomings of current on-policy distillation techniques, which treat each rollout in isolation, disregarding other attempts for the same prompt. This research is available on arXiv under ID 2605.12652v1.
Key facts
- MOPD is a peer-conditioned distillation framework for LLMs
- It uses sparse verifier rewards that indicate trajectory success
- On-policy distillation provides denser token-level supervision
- Existing methods distill each rollout independently
- MOPD conditions the teacher on both successful and failed peer rollouts
- Successes provide positive evidence for valid reasoning patterns
- Failures provide structured negative evidence about mistakes to avoid
- Two peer-context constructions are studied: positive peer imitation and contrastive success-failure
Entities
Institutions
- arXiv