Benchmarking-Cultures-25 dataset reveals fragmented AI evaluation landscape

ai-technology · 2026-05-16

A recent study has unveiled Benchmarking-Cultures-25, an open-source collection featuring 231 benchmarks from 139 model releases in 2025, contributed by 11 prominent AI developers. According to the research published on arXiv, a significant 63.2% of these benchmarks are utilized by only one builder, while 38.5% are found in just a single release, suggesting a lack of comparability across models. Very few benchmarks, including GPQA Diamond and LiveCodeBench, enjoy widespread adoption. The findings emphasize a shift in how AI model capabilities are shared, with a growing reliance on press releases and blog entries instead of traditional peer-reviewed publications.

Key facts

Benchmarking-Cultures-25 dataset includes 231 benchmarks from 139 model releases in 2025.
11 major AI builders are covered in the dataset.
63.2% of highlighted benchmarks are used by a single builder.
38.5% of benchmarks appear in just one release.
GPQA Diamond and LiveCodeBench are among the few widely used benchmarks.
The study notes a shift from peer-reviewed literature to press releases and blog posts for establishing AI model competencies.
The dataset is open-source and includes an interactive exploration tool.
The research was published on arXiv.

Benchmarking-Cultures-25 dataset reveals fragmented AI evaluation landscape

Key facts

Entities

Institutions

Sources