DCPO Framework Decouples Reasoning and Calibration in RLVR

ai-technology · 2026-05-01

A new framework called DCPO (Decoupled Calibration and Policy Optimization) addresses the calibration degeneration problem in Reinforcement Learning from Verifiable Rewards (RLVR) for large language models. Researchers identified a fundamental gradient conflict between maximizing policy accuracy and minimizing calibration error, which causes models to become over-confident in incorrect answers. DCPO systematically separates reasoning and calibration objectives, preserving accuracy comparable to GRPO while achieving superior calibration and mitigating over-confidence. The study provides theoretical analysis and practical solutions for improving LLM reliability.

Key facts

RLVR enhances LLM reasoning but causes calibration degeneration
Models become over-confident in incorrect answers
Gradient conflict exists between accuracy and calibration objectives
DCPO decouples reasoning and calibration objectives
DCPO preserves accuracy on par with GRPO
DCPO achieves best calibration performance
DCPO substantially mitigates over-confidence issue
Study provides theoretical insights and practical solution

Entities

—

Sources

arXiv cs.AI — 2026-05-01