AgentAtlas: New Taxonomy for Evaluating LLM Agents Beyond Accuracy
A new framework called AgentAtlas proposes a more nuanced evaluation of large language model (LLM) agents. Current benchmarks are fragmented, focusing on metrics like task success or tool-call validity. AgentAtlas introduces a six-state control-decision taxonomy (Act, Ask, Refuse, Stop, Confirm, Recover) and a nine-category trajectory-failure taxonomy with hierarchical labels. It also measures how much of a model's performance stems from prompt supervision versus inherent capability. The work addresses the need for deployable agent evaluation beyond single accuracy scores.
Key facts
- AgentAtlas introduces a six-state control-decision taxonomy: Act, Ask, Refuse, Stop, Confirm, Recover.
- A nine-category trajectory-failure taxonomy is included with orthogonal labels: primary_error_source and impact.
- The framework includes a taxonomy-aware vs. taxonomy-blind methodology.
- Current benchmarks are fragmented, each emphasizing different units of measurement.
- A line of 2024-2025 work has converged on the diagnosis that single accuracy is insufficient.
- AgentAtlas extends this line of work with four components.
- The benchmark-coverage audit mapping is part of AgentAtlas.
- The work focuses on deployable agents acting on codebases, browsers, operating systems, etc.
Entities
Institutions
- arXiv