Benchmarking-Cultures-25 dataset reveals fragmented AI evaluation landscape
A recent study has unveiled Benchmarking-Cultures-25, an open-source collection featuring 231 benchmarks from 139 model releases in 2025, contributed by 11 prominent AI developers. According to the research published on arXiv, a significant 63.2% of these benchmarks are utilized by only one builder, while 38.5% are found in just a single release, suggesting a lack of comparability across models. Very few benchmarks, including GPQA Diamond and LiveCodeBench, enjoy widespread adoption. The findings emphasize a shift in how AI model capabilities are shared, with a growing reliance on press releases and blog entries instead of traditional peer-reviewed publications.
Key facts
- Benchmarking-Cultures-25 dataset includes 231 benchmarks from 139 model releases in 2025.
- 11 major AI builders are covered in the dataset.
- 63.2% of highlighted benchmarks are used by a single builder.
- 38.5% of benchmarks appear in just one release.
- GPQA Diamond and LiveCodeBench are among the few widely used benchmarks.
- The study notes a shift from peer-reviewed literature to press releases and blog posts for establishing AI model competencies.
- The dataset is open-source and includes an interactive exploration tool.
- The research was published on arXiv.
Entities
Institutions
- arXiv