GCPO: Cooperative Policy Optimization Boosts LLM Reasoning Diversity
A new reinforcement learning method, Group Cooperative Policy Optimization (GCPO), addresses exploration collapse in LLM reasoning by replacing winner-takes-all competition with team-level credit assignment. Unlike GRPO, which suffers from premature convergence on narrow patterns, GCPO rewards rollouts based on their contribution to the team's valid solution coverage. This shifts the training paradigm from rollout competition to team cooperation, promoting diverse reasoning strategies. The approach is detailed in arXiv paper 2605.11461.
Key facts
- GCPO stands for Group Cooperative Policy Optimization
- GCPO addresses exploration collapse in LLM reasoning
- GRPO suffers from premature convergence on narrow patterns
- GCPO replaces winner-takes-all competition with team cooperation
- GCPO uses team-level credit assignment
- Rollouts are rewarded by contribution to valid solution coverage
- The paper is on arXiv with ID 2605.11461
- The method shifts from rollout competition to team cooperation
Entities
—