ConSPO: Contrastive Framework Improves GRPO for LLM Reasoning
A new paper on arXiv (2605.12969) revisits Reinforcement Learning with Verifiable Rewards (RLVR) from a contrastive perspective, focusing on GRPO, a key algorithm for improving LLM reasoning. The authors show GRPO is equivalent to a weighted positive-negative score difference, optimizing clipped token-level importance sampling ratios. They identify two limitations: likelihood-misaligned scoring and score-insensitive credit assignment. To address these, they propose ConSPO (Contrastive Sequence-level Policy Optimization), a framework that better aligns optimization with generation likelihoods and accounts for relative score gaps between positive and negative rollouts.
Key facts
- arXiv paper 2605.12969 revisits RLVR from a contrastive perspective
- GRPO is reformulated as a weighted positive-negative score difference
- GRPO optimizes clipped token-level importance sampling ratios
- Two limitations identified: likelihood-misaligned scoring and score-insensitive credit assignment
- ConSPO proposed to address these limitations
- ConSPO stands for Contrastive Sequence-level Policy Optimization
- The paper is a cross type announcement on arXiv
- The work aims to improve LLM reasoning capabilities
Entities
Institutions
- arXiv