RLCR: Training Language Models to Reason About Uncertainty
A new method called RLCR (Reinforcement Learning with Calibration Rewards) trains language models to generate both predictions and numerical confidence estimates, optimizing a reward function that improves accuracy and calibrated confidence estimation. Standard binary reward functions in RL for reasoning degrade calibration and increase hallucination rates. RLCR addresses this by augmenting the reward function to penalize low-confidence outputs.
Key facts
- arXiv:2507.16806v2
- RLCR stands for Reinforcement Learning with Calibration Rewards
- Binary reward functions degrade calibration and increase hallucination rates
- RLCR jointly improves accuracy and calibrated confidence estimation
- LMs generate both predictions and numerical confidence estimates after reasoning
- The reward function augments a binary reward to penalize low-confidence outputs
Entities
—