LLM Leaderboards Critiqued: New Study Proposes User-Defined Evaluation

ai-technology · 2026-04-25

A new study published on arXiv (2604.21769) critically examines LLM leaderboards, revealing that rankings are shaped by benchmark designers' priorities rather than diverse user needs. The analysis of LMArena (formerly Chatbot Arena) data shows topic skew, variable model rankings across prompt slices, and misuse of preference-based judgments. The authors propose an interactive visualization interface as a design probe, enabling users to define their own evaluation criteria by selecting and weighting prompt types.

Key facts

arXiv paper 2604.21769 critiques LLM leaderboards.
Leaderboard rankings reflect benchmark designers' priorities, not user goals.
LMArena dataset is heavily skewed toward certain topics.
Model rankings vary across different prompt slices.
Preference-based judgments are used in ways that blur their intended scope.
Authors introduce an interactive visualization interface as a design probe.
Interface allows users to define their own evaluation priorities.
Study conducted by researchers analyzing LMArena benchmark.

LLM Leaderboards Critiqued: New Study Proposes User-Defined Evaluation

Key facts

Entities

Institutions

Sources