AI research audit reveals systematic capability misrepresentation in LLM evaluations

ai-technology · 2026-05-07

An audit analyzing 112,303 candidate records related to large language models (LLMs) from January 2022 to April 2026 has uncovered widespread misrepresentation of capabilities in evaluations. From the filtered data, 18,574 papers were deemed admissible, with 4,766 full texts retrievable. The research, utilizing the Epoch AI Capabilities Index under Arena Elo and Artificial Analysis, highlighted a 'publication elicitation gap' between current models and leading benchmarks. The findings, which illustrate how vague claims about AI proliferate through citations and media, are published on arXiv as 2605.04135v1, based on a pre-registered study design.

Key facts

Audit covered 112,303 LLM-keyword-matched candidate records from 2022-01 to 2026-04
18,574 papers were admissible, 4,766 full-paper texts were retrievable
Measured 'publication elicitation gap' between tested models and contemporaneous frontier
Used Epoch AI Capabilities Index (ECI) reproduced under Arena Elo and Artificial Analysis
Example: 2026 paper evaluating GPT-4o-mini zero-shot against GPT-5.5 Pro and Claude Opus 4.7
Sparse configuration details and abstracted claims about 'AI' propagate through citations, media, and policy
Pre-registered study design
Published on arXiv as 2605.04135v1

AI research audit reveals systematic capability misrepresentation in LLM evaluations

Key facts

Entities

Institutions

Sources