ARTFEED — Contemporary Art Intelligence

DCPO Framework Decouples Reasoning and Calibration in RLVR

ai-technology · 2026-05-01

A new framework called DCPO (Decoupled Calibration and Policy Optimization) addresses the calibration degeneration problem in Reinforcement Learning from Verifiable Rewards (RLVR) for large language models. Researchers identified a fundamental gradient conflict between maximizing policy accuracy and minimizing calibration error, which causes models to become over-confident in incorrect answers. DCPO systematically separates reasoning and calibration objectives, preserving accuracy comparable to GRPO while achieving superior calibration and mitigating over-confidence. The study provides theoretical analysis and practical solutions for improving LLM reliability.

Key facts

  • RLVR enhances LLM reasoning but causes calibration degeneration
  • Models become over-confident in incorrect answers
  • Gradient conflict exists between accuracy and calibration objectives
  • DCPO decouples reasoning and calibration objectives
  • DCPO preserves accuracy on par with GRPO
  • DCPO achieves best calibration performance
  • DCPO substantially mitigates over-confidence issue
  • Study provides theoretical insights and practical solution

Entities

Sources