ARTFEED — Contemporary Art Intelligence

LLM Reward Models Favor Socially Undesirable Responses

ai-technology · 2026-05-07

A new study from arXiv (2605.05003) reveals that reward models used to align large language models (LLMs) with human preferences often prefer socially undesirable responses. The researchers extended benchmarking to four domains: bias, safety, morality, and ethical reasoning. They introduced a framework converting social evaluation datasets into pairwise preference data, testing five publicly available reward models and two instruction-tuned models. Results show systematic biases in reward model outputs, indicating failures in social alignment that standard instruction-following benchmarks miss.

Key facts

  • arXiv paper 2605.05003 examines reward model alignment in LLMs
  • Focus on four domains: bias, safety, morality, ethical reasoning
  • Framework converts social evaluation datasets into pairwise preference data
  • Tested five publicly available reward models and two instruction-tuned models
  • Reward models often prefer socially undesirable responses
  • Systematically biased distributions over selected outputs observed
  • Existing evaluations focus on broad instruction-following benchmarks
  • Important failures in social alignment can remain hidden

Entities

Institutions

  • arXiv

Sources