ARTFEED — Contemporary Art Intelligence

DMPO: A New RL Method to Prevent Mode Collapse in Diverse Reasoning

ai-technology · 2026-05-20

A new reinforcement learning method called DMPO (Distribution-Matching Policy Optimization) addresses mode collapse in on-policy algorithms like GRPO. Mode collapse occurs when models concentrate probability mass on a single solution, ceasing exploration of alternatives. The authors show this stems from reverse KL minimization's mode-seeking behavior. DMPO approximates forward KL minimization by constructing a group-level target distribution over sampled trajectories proportional to their rewards and aligning the policy distribution to it. This provides mode-covering behavior without sampling from the intractable global target, enabling sustained exploration. The method is validated on NP-hard combinatorial problems, demonstrating maintained diversity in reasoning.

Key facts

  • On-policy RL methods like GRPO suffer from mode collapse.
  • Mode collapse reduces solution diversity and stops exploration.
  • The cause is reverse KL minimization's mode-seeking behavior.
  • DMPO approximates forward KL minimization to prevent mode collapse.
  • DMPO constructs a group-level target distribution proportional to rewards.
  • DMPO aligns policy distribution to the target without global sampling.
  • DMPO enables sustained exploration throughout training.
  • Validation on NP-hard combinatorial problems shows maintained diversity.

Entities

Institutions

  • arXiv

Sources