UCPO Framework Tackles Overconfidence in LLMs
A new framework called Uncertainty-Aware Policy Optimization (UCPO) addresses reward hacking and overconfidence in large language models (LLMs) trained with reinforcement learning. The paper, published on arXiv (2601.22648), identifies Advantage Bias in existing RL paradigms like GRPO, caused by binary decision spaces and static uncertainty rewards. UCPO introduces Ternary Advantage Decoupling to separate and normalize deterministic and uncertain rollouts, eliminating bias. It also features a Dynamic Uncertainty Reward Adjustment mechanism that adapts uncertainty weights in real-time based on model evolution and instance difficulty. The goal is to endow LLMs with inherent uncertainty expression, reducing overconfident errors in high-stakes applications.
Key facts
- UCPO stands for Uncertainty-Aware Policy Optimization
- The paper is on arXiv with ID 2601.22648
- It addresses Advantage Bias in RL paradigms like GRPO
- Ternary Advantage Decoupling is a key component
- Dynamic Uncertainty Reward Adjustment adapts weights in real-time
- The framework aims to reduce overconfident errors in LLMs
- High-stakes applications are the target use case
- The paper was announced as a replacement (v2)
Entities
Institutions
- arXiv