Three Conceptual Models of RLHF Annotation: Extension, Evidence, and Authority
A new arXiv paper (2604.25895) distinguishes three normative models for the role of human judgments in Reinforcement Learning with Human Feedback (RLHF). The 'extension' model treats annotators as extending system designers' preferences. The 'evidence' model sees annotators providing independent facts about moral or social matters. The 'authority' model grants annotators independent authority as population representatives. The author argues these models affect how RLHF pipelines should solicit, validate, and aggregate annotations. The paper surveys landmark RLHF studies to show how they implicitly use these models and describes associated failure modes.
Key facts
- Paper distinguishes three models: extension, evidence, authority
- Extension: annotators extend designers' judgments
- Evidence: annotators provide independent factual evidence
- Authority: annotators have authority as population representatives
- Models have implications for soliciting, validating, aggregating annotations
- Survey of landmark RLHF papers illustrates implicit use of models
- Failure modes are described for each model
Entities
—