Uncertainty-Aware Reward Framework to Prevent RL Reward Hacking

ai-technology · 2026-04-30

A novel framework for reinforcement learning tackles the issue of reward hacking by incorporating two types of uncertainty: epistemic uncertainty related to value estimation and uncertainty concerning human preferences. This method employs ensemble disagreement to represent model uncertainty and leverages variability in reward annotations to account for preference uncertainty. An adaptive Reliability Filter, adjusted for confidence, regulates action selection to find a balance between exploitation and caution. Empirical findings from various discrete environments show a decrease in over-optimization and failures in alignment.

Key facts

arXiv:2604.26360
Reinforcement learning systems typically optimize scalar reward functions assuming precise evaluation.
Real-world objectives from human preferences are often uncertain and inconsistent.
Dual-source uncertainty-aware reward framework models epistemic and preference uncertainty.
Model uncertainty captured via ensemble disagreement over value predictions.
Preference uncertainty derived from variability in reward annotations.
Confidence-adjusted Reliability Filter adaptively modulates action selection.
Empirical results across multiple discrete environments.

Uncertainty-Aware Reward Framework to Prevent RL Reward Hacking

Key facts

Entities

Institutions

Sources