ARTFEED — Contemporary Art Intelligence

AI research audit reveals systematic capability misrepresentation in LLM evaluations

ai-technology · 2026-05-07

An audit analyzing 112,303 candidate records related to large language models (LLMs) from January 2022 to April 2026 has uncovered widespread misrepresentation of capabilities in evaluations. From the filtered data, 18,574 papers were deemed admissible, with 4,766 full texts retrievable. The research, utilizing the Epoch AI Capabilities Index under Arena Elo and Artificial Analysis, highlighted a 'publication elicitation gap' between current models and leading benchmarks. The findings, which illustrate how vague claims about AI proliferate through citations and media, are published on arXiv as 2605.04135v1, based on a pre-registered study design.

Key facts

  • Audit covered 112,303 LLM-keyword-matched candidate records from 2022-01 to 2026-04
  • 18,574 papers were admissible, 4,766 full-paper texts were retrievable
  • Measured 'publication elicitation gap' between tested models and contemporaneous frontier
  • Used Epoch AI Capabilities Index (ECI) reproduced under Arena Elo and Artificial Analysis
  • Example: 2026 paper evaluating GPT-4o-mini zero-shot against GPT-5.5 Pro and Claude Opus 4.7
  • Sparse configuration details and abstracted claims about 'AI' propagate through citations, media, and policy
  • Pre-registered study design
  • Published on arXiv as 2605.04135v1

Entities

Institutions

  • arXiv
  • Epoch AI

Sources