UCPO Framework Tackles Overconfidence in LLMs

ai-technology · 2026-05-27

A new framework called Uncertainty-Aware Policy Optimization (UCPO) addresses reward hacking and overconfidence in large language models (LLMs) trained with reinforcement learning. The paper, published on arXiv (2601.22648), identifies Advantage Bias in existing RL paradigms like GRPO, caused by binary decision spaces and static uncertainty rewards. UCPO introduces Ternary Advantage Decoupling to separate and normalize deterministic and uncertain rollouts, eliminating bias. It also features a Dynamic Uncertainty Reward Adjustment mechanism that adapts uncertainty weights in real-time based on model evolution and instance difficulty. The goal is to endow LLMs with inherent uncertainty expression, reducing overconfident errors in high-stakes applications.

Key facts

UCPO stands for Uncertainty-Aware Policy Optimization
The paper is on arXiv with ID 2601.22648
It addresses Advantage Bias in RL paradigms like GRPO
Ternary Advantage Decoupling is a key component
Dynamic Uncertainty Reward Adjustment adapts weights in real-time
The framework aims to reduce overconfident errors in LLMs
High-stakes applications are the target use case
The paper was announced as a replacement (v2)

UCPO Framework Tackles Overconfidence in LLMs

Key facts

Entities

Institutions

Sources