ARTFEED — Contemporary Art Intelligence

StoryRMB: New Benchmark Reveals LLMs Struggle with Human Story Preferences

ai-technology · 2026-05-07

StoryRMB has been launched by researchers as the inaugural benchmark for assessing reward models related to story generation preferences. This benchmark comprises 1,133 meticulously curated instances, each featuring a prompt, a selected story, and three stories that were not chosen. Testing indicated that current reward models find it challenging to identify stories preferred by humans, with the top-performing model reaching merely 66.3% accuracy. To enhance the effectiveness of reward models, the team created around 100,000 high-quality training instances. This research underscores the disparity between narratives produced by large language models and those crafted by humans, particularly regarding intricate narrative structures and individual preferences.

Key facts

  • StoryRMB is the first benchmark for assessing reward models on story preferences.
  • The benchmark contains 1,133 human-verified instances.
  • Each instance includes a prompt, one chosen story, and three rejected stories.
  • The best existing reward model achieved only 66.3% accuracy.
  • Researchers constructed roughly 100,000 high-quality training instances.
  • LLM-generated stories diverge from human-authored works in narrative structure.
  • Human story preferences are inherently subjective and under-explored.
  • The work aims to improve modeling of human story preferences.

Entities

Institutions

  • arXiv

Sources