ARTFEED — Contemporary Art Intelligence

ConSPO: Contrastive Framework Improves GRPO for LLM Reasoning

ai-technology · 2026-05-14

A new paper on arXiv (2605.12969) revisits Reinforcement Learning with Verifiable Rewards (RLVR) from a contrastive perspective, focusing on GRPO, a key algorithm for improving LLM reasoning. The authors show GRPO is equivalent to a weighted positive-negative score difference, optimizing clipped token-level importance sampling ratios. They identify two limitations: likelihood-misaligned scoring and score-insensitive credit assignment. To address these, they propose ConSPO (Contrastive Sequence-level Policy Optimization), a framework that better aligns optimization with generation likelihoods and accounts for relative score gaps between positive and negative rollouts.

Key facts

  • arXiv paper 2605.12969 revisits RLVR from a contrastive perspective
  • GRPO is reformulated as a weighted positive-negative score difference
  • GRPO optimizes clipped token-level importance sampling ratios
  • Two limitations identified: likelihood-misaligned scoring and score-insensitive credit assignment
  • ConSPO proposed to address these limitations
  • ConSPO stands for Contrastive Sequence-level Policy Optimization
  • The paper is a cross type announcement on arXiv
  • The work aims to improve LLM reasoning capabilities

Entities

Institutions

  • arXiv

Sources