Mitigating Cognitive Bias in RLHF by Context-Dependent Rationality

other · 2026-05-11

A new arXiv paper (2605.06895) proposes treating the rationality parameter in reinforcement learning from human feedback (RLHF) as context- and annotation-dependent, rather than a fixed constant, to mitigate cognitive biases in human judgments. The standard Boltzmann model assumes uniform annotator reliability, but real human feedback is shaped by systematic biases that vary contextually. The authors design a method to adjust rationality based on annotation context, aiming to make models robust to imperfect human feedback.

Key facts

arXiv paper 2605.06895 proposes context-dependent rationality in RLHF
Standard RLHF uses a fixed rationality parameter beta
Human feedback is affected by cognitive biases
The method treats rationality as context- and annotation-dependent
Goal is to make models robust to imperfect human feedback

Mitigating Cognitive Bias in RLHF by Context-Dependent Rationality

Key facts

Entities

Institutions

Sources