Three Conceptual Models of RLHF Annotation: Extension, Evidence, and Authority

other · 2026-04-30

A new arXiv paper (2604.25895) distinguishes three normative models for the role of human judgments in Reinforcement Learning with Human Feedback (RLHF). The 'extension' model treats annotators as extending system designers' preferences. The 'evidence' model sees annotators providing independent facts about moral or social matters. The 'authority' model grants annotators independent authority as population representatives. The author argues these models affect how RLHF pipelines should solicit, validate, and aggregate annotations. The paper surveys landmark RLHF studies to show how they implicitly use these models and describes associated failure modes.

Key facts

Paper distinguishes three models: extension, evidence, authority
Extension: annotators extend designers' judgments
Evidence: annotators provide independent factual evidence
Authority: annotators have authority as population representatives
Models have implications for soliciting, validating, aggregating annotations
Survey of landmark RLHF papers illustrates implicit use of models
Failure modes are described for each model

Entities

—

Sources

arXiv cs.AI — 2026-04-29