RMGAP Benchmark Evaluates Reward Model Generalization

other · 2026-05-06

Researchers introduced RMGAP, a benchmark to evaluate how reward models generalize across diverse user preferences in Reinforcement Learning from Human Feedback. The benchmark comprises 1,097 instances across Chat, Writing, Reasoning, and Safety domains. For each prompt, four distinct responses with different linguistic profiles were generated to represent varied preferences. Tailored prompts were constructed to convey specific preferences, addressing the limitation of existing benchmarks that assume a universal preference. This work focuses on the ability of reward models to correctly rank responses aligned with diverse user preferences, a critical gap in current evaluation methods.

Key facts

RMGAP benchmark introduced
1,097 instances across Chat, Writing, Reasoning, Safety domains
Four distinct responses per prompt with different linguistic profiles
Tailored prompts constructed to convey specific preferences
Addresses limitation of existing benchmarks assuming universal preference
Focuses on reward model generalizability
Reinforcement Learning from Human Feedback context
Evaluates ability to rank responses aligned with diverse preferences

Entities

—

Sources

arXiv cs.AI — 2026-05-05