RLCR: Training Language Models to Reason About Uncertainty

ai-technology · 2026-05-18

A new method called RLCR (Reinforcement Learning with Calibration Rewards) trains language models to generate both predictions and numerical confidence estimates, optimizing a reward function that improves accuracy and calibrated confidence estimation. Standard binary reward functions in RL for reasoning degrade calibration and increase hallucination rates. RLCR addresses this by augmenting the reward function to penalize low-confidence outputs.

Key facts

arXiv:2507.16806v2
RLCR stands for Reinforcement Learning with Calibration Rewards
Binary reward functions degrade calibration and increase hallucination rates
RLCR jointly improves accuracy and calibrated confidence estimation
LMs generate both predictions and numerical confidence estimates after reasoning
The reward function augments a binary reward to penalize low-confidence outputs

Entities

—

Sources

arXiv cs.AI — 2026-05-18