LLM Reward Models Favor Socially Undesirable Responses

ai-technology · 2026-05-07

A new study from arXiv (2605.05003) reveals that reward models used to align large language models (LLMs) with human preferences often prefer socially undesirable responses. The researchers extended benchmarking to four domains: bias, safety, morality, and ethical reasoning. They introduced a framework converting social evaluation datasets into pairwise preference data, testing five publicly available reward models and two instruction-tuned models. Results show systematic biases in reward model outputs, indicating failures in social alignment that standard instruction-following benchmarks miss.

Key facts

arXiv paper 2605.05003 examines reward model alignment in LLMs
Focus on four domains: bias, safety, morality, ethical reasoning
Framework converts social evaluation datasets into pairwise preference data
Tested five publicly available reward models and two instruction-tuned models
Reward models often prefer socially undesirable responses
Systematically biased distributions over selected outputs observed
Existing evaluations focus on broad instruction-following benchmarks
Important failures in social alignment can remain hidden

LLM Reward Models Favor Socially Undesirable Responses

Key facts

Entities

Institutions

Sources