BenchGuard: Automated Auditing of LLM Agent Benchmarks

ai-technology · 2026-04-30

BenchGuard, a novel framework, employs cutting-edge LLMs to conduct systematic audits of benchmarks for execution-based, task-oriented agents. It verifies all benchmark artifacts through structured LLM protocols and can optionally include agent solutions or execution traces. Implemented on ScienceAgentBench and BIXBench, BenchGuard uncovered 12 issues confirmed by authors in ScienceAgentBench, including critical errors, and aligned with 83.3% of the problems identified by experts on BIXBench Verified-50. This study emphasizes that numerous failures attributed to agents are, in fact, failures of the benchmarks caused by flawed specifications, implicit assumptions, or inflexible evaluation scripts.

Key facts

BenchGuard is the first automated auditing framework for task-oriented, execution-based agent benchmarks.
It uses frontier LLMs as systematic auditors of evaluation infrastructure.
Cross-verifies all benchmark artifacts via structured LLM protocols.
Can incorporate agent solutions or execution traces as additional diagnostic evidence.
Deployed on ScienceAgentBench and BIXBench.
Identified 12 author-confirmed issues in ScienceAgentBench, including fatal errors.
Exactly matched 83.3% of expert-identified issues on BIXBench Verified-50.
Many agent failures are actually benchmark failures.

BenchGuard: Automated Auditing of LLM Agent Benchmarks

Key facts

Entities

Institutions

Sources