MOPD: Multi-Rollout On-Policy Distillation for LLMs

ai-technology · 2026-05-14

A new framework called Multi-Rollout On-Policy Distillation (MOPD) has been developed by researchers to enhance post-training large language models using sparse verifier rewards. MOPD utilizes the local rollout group of the student to generate more insightful teacher signals by taking into account both successful and unsuccessful peer rollouts. Successful outcomes reinforce valid reasoning patterns, while failures provide structured insights into errors to avoid. The framework investigates two peer-context configurations: positive peer imitation and contrastive success-failure. This method overcomes the shortcomings of current on-policy distillation techniques, which treat each rollout in isolation, disregarding other attempts for the same prompt. This research is available on arXiv under ID 2605.12652v1.

Key facts

MOPD is a peer-conditioned distillation framework for LLMs
It uses sparse verifier rewards that indicate trajectory success
On-policy distillation provides denser token-level supervision
Existing methods distill each rollout independently
MOPD conditions the teacher on both successful and failed peer rollouts
Successes provide positive evidence for valid reasoning patterns
Failures provide structured negative evidence about mistakes to avoid
Two peer-context constructions are studied: positive peer imitation and contrastive success-failure

MOPD: Multi-Rollout On-Policy Distillation for LLMs

Key facts

Entities

Institutions

Sources