DCPO Framework Decouples Reasoning and Calibration in RLVR
A new framework called DCPO (Decoupled Calibration and Policy Optimization) addresses the calibration degeneration problem in Reinforcement Learning from Verifiable Rewards (RLVR) for large language models. Researchers identified a fundamental gradient conflict between maximizing policy accuracy and minimizing calibration error, which causes models to become over-confident in incorrect answers. DCPO systematically separates reasoning and calibration objectives, preserving accuracy comparable to GRPO while achieving superior calibration and mitigating over-confidence. The study provides theoretical analysis and practical solutions for improving LLM reliability.
Key facts
- RLVR enhances LLM reasoning but causes calibration degeneration
- Models become over-confident in incorrect answers
- Gradient conflict exists between accuracy and calibration objectives
- DCPO decouples reasoning and calibration objectives
- DCPO preserves accuracy on par with GRPO
- DCPO achieves best calibration performance
- DCPO substantially mitigates over-confidence issue
- Study provides theoretical insights and practical solution
Entities
—