AI Evaluation Costs Surge, Creating New Compute Bottleneck
Evaluating AI has turned into a significant financial challenge, with costs for agents now reaching tens of thousands of dollars per test. The Holistic Agent Leaderboard (HAL) spent $40,000 for 21,730 agent tests across 9 models and benchmarks. Running a single GAIA on a frontier model costs $2,829. Exgentic's analysis, costing $22,000, showed a 33× price difference for similar tasks. The UK-AISI has pushed agentic costs into the millions. Testing a new architecture at The Well requires about 960 H100-hours, while a full evaluation takes 3,840 H100-hours. Static benchmarks like HELM cost nearly $100,000 for 30 models and 42 scenarios. Compression techniques can significantly cut costs for static benchmarks but only slightly for agent benchmarks. The financial burden of reliability testing is hefty, with a valid HAL evaluation costing $320,000 for 8 reruns. PaperBench charges $9,500 per test, and comparing six models with three seeds exceeds $150,000. This growing disparity in computing resources affects evaluations, leaving academic institutions, AI Safety Institutes, and journalists at a disadvantage. Costly leaderboards lead to inefficiency, and there's no shared infrastructure for reusing evaluation data.
Key facts
- HAL spent $40,000 for 21,730 agent rollouts across 9 models and 9 benchmarks.
- A single GAIA run can cost $2,829 before caching.
- Exgentic's $22,000 sweep found a 33× cost spread on identical tasks.
- The Well costs 960 H100-hours to evaluate one new architecture.
- Static benchmarks compress 100-200×; agent benchmarks only 2-3.5×.
- Reliability testing with 8 reruns would push HAL cost to $320,000.
- PaperBench costs $9,500 per run; three-seed comparison of six models exceeds $150,000.
- UK-AISI scaled agentic steps into the millions to study inference-time compute.
Entities
Institutions
- Holistic Agent Leaderboard (HAL)
- Princeton University
- Exgentic
- UK-AISI
- Stanford CRFM
- IBM Research
- EleutherAI
- OpenAI
- METR
- ICLR
- ACL
- ICML
- Science (journal)
- arXiv
Locations
- United Kingdom
- United States