AgentAtlas: New Taxonomy for Evaluating LLM Agents Beyond Accuracy

other · 2026-05-22

A new framework called AgentAtlas proposes a more nuanced evaluation of large language model (LLM) agents. Current benchmarks are fragmented, focusing on metrics like task success or tool-call validity. AgentAtlas introduces a six-state control-decision taxonomy (Act, Ask, Refuse, Stop, Confirm, Recover) and a nine-category trajectory-failure taxonomy with hierarchical labels. It also measures how much of a model's performance stems from prompt supervision versus inherent capability. The work addresses the need for deployable agent evaluation beyond single accuracy scores.

Key facts

AgentAtlas introduces a six-state control-decision taxonomy: Act, Ask, Refuse, Stop, Confirm, Recover.
A nine-category trajectory-failure taxonomy is included with orthogonal labels: primary_error_source and impact.
The framework includes a taxonomy-aware vs. taxonomy-blind methodology.
Current benchmarks are fragmented, each emphasizing different units of measurement.
A line of 2024-2025 work has converged on the diagnosis that single accuracy is insufficient.
AgentAtlas extends this line of work with four components.
The benchmark-coverage audit mapping is part of AgentAtlas.
The work focuses on deployable agents acting on codebases, browsers, operating systems, etc.

AgentAtlas: New Taxonomy for Evaluating LLM Agents Beyond Accuracy

Key facts

Entities

Institutions

Sources