StoryRMB: New Benchmark Reveals LLMs Struggle with Human Story Preferences
StoryRMB has been launched by researchers as the inaugural benchmark for assessing reward models related to story generation preferences. This benchmark comprises 1,133 meticulously curated instances, each featuring a prompt, a selected story, and three stories that were not chosen. Testing indicated that current reward models find it challenging to identify stories preferred by humans, with the top-performing model reaching merely 66.3% accuracy. To enhance the effectiveness of reward models, the team created around 100,000 high-quality training instances. This research underscores the disparity between narratives produced by large language models and those crafted by humans, particularly regarding intricate narrative structures and individual preferences.
Key facts
- StoryRMB is the first benchmark for assessing reward models on story preferences.
- The benchmark contains 1,133 human-verified instances.
- Each instance includes a prompt, one chosen story, and three rejected stories.
- The best existing reward model achieved only 66.3% accuracy.
- Researchers constructed roughly 100,000 high-quality training instances.
- LLM-generated stories diverge from human-authored works in narrative structure.
- Human story preferences are inherently subjective and under-explored.
- The work aims to improve modeling of human story preferences.
Entities
Institutions
- arXiv