Systematic Analysis of AI Agent Safety Benchmarks Reveals Inconsistencies
A new study from arXiv presents the first systematic analysis of safety benchmarks for LLM-based autonomous agents, identifying significant inconsistencies in threat models, metrics, and risk coverage. The research catalogs 40 behavioral agent-safety benchmarks from 2023 to 2026, plus 5 adjacent artifacts, and proposes a six-axis taxonomy for evaluating benchmark methodology. A coverage matrix shows broad risk coverage but limited methodological convergence, with most benchmarks concentrated in sandboxed, constrained, and safety-only environments. The study highlights the need for standardized evaluation frameworks as agent deployment accelerates.
Key facts
- First systematic analysis dedicated to agent safety benchmarks as evaluation instruments.
- Cataloged 40 behavioral agent-safety benchmarks from 2023 to 2026.
- Also includes 5 adjacent evaluator, defense, and dataset artifacts.
- Proposes a six-axis taxonomy of benchmark evaluation methodology.
- Coverage matrix reveals broad risk coverage but limited methodological convergence.
- Behavioral-benchmark core concentrated in sandboxed, constrained, and often safety-only environments.
- Benchmarks developed independently with inconsistent threat models and incompatible metrics.
- Study addresses safety risks extending beyond traditional LLM concerns.
Entities
Institutions
- arXiv