GCPO: Cooperative Policy Optimization Boosts LLM Reasoning Diversity

ai-technology · 2026-05-13

A new reinforcement learning method, Group Cooperative Policy Optimization (GCPO), addresses exploration collapse in LLM reasoning by replacing winner-takes-all competition with team-level credit assignment. Unlike GRPO, which suffers from premature convergence on narrow patterns, GCPO rewards rollouts based on their contribution to the team's valid solution coverage. This shifts the training paradigm from rollout competition to team cooperation, promoting diverse reasoning strategies. The approach is detailed in arXiv paper 2605.11461.

Key facts

GCPO stands for Group Cooperative Policy Optimization
GCPO addresses exploration collapse in LLM reasoning
GRPO suffers from premature convergence on narrow patterns
GCPO replaces winner-takes-all competition with team cooperation
GCPO uses team-level credit assignment
Rollouts are rewarded by contribution to valid solution coverage
The paper is on arXiv with ID 2605.11461
The method shifts from rollout competition to team cooperation

Entities

—

Sources

arXiv cs.AI — 2026-05-13