ConSPO: Contrastive Framework Improves GRPO for LLM Reasoning

ai-technology · 2026-05-14

A new paper on arXiv (2605.12969) revisits Reinforcement Learning with Verifiable Rewards (RLVR) from a contrastive perspective, focusing on GRPO, a key algorithm for improving LLM reasoning. The authors show GRPO is equivalent to a weighted positive-negative score difference, optimizing clipped token-level importance sampling ratios. They identify two limitations: likelihood-misaligned scoring and score-insensitive credit assignment. To address these, they propose ConSPO (Contrastive Sequence-level Policy Optimization), a framework that better aligns optimization with generation likelihoods and accounts for relative score gaps between positive and negative rollouts.

Key facts

arXiv paper 2605.12969 revisits RLVR from a contrastive perspective
GRPO is reformulated as a weighted positive-negative score difference
GRPO optimizes clipped token-level importance sampling ratios
Two limitations identified: likelihood-misaligned scoring and score-insensitive credit assignment
ConSPO proposed to address these limitations
ConSPO stands for Contrastive Sequence-level Policy Optimization
The paper is a cross type announcement on arXiv
The work aims to improve LLM reasoning capabilities

ConSPO: Contrastive Framework Improves GRPO for LLM Reasoning

Key facts

Entities

Institutions

Sources