ARTFEED — Contemporary Art Intelligence

AgentAtlas: New Taxonomy for Evaluating LLM Agents Beyond Accuracy

other · 2026-05-22

A new framework called AgentAtlas proposes a more nuanced evaluation of large language model (LLM) agents. Current benchmarks are fragmented, focusing on metrics like task success or tool-call validity. AgentAtlas introduces a six-state control-decision taxonomy (Act, Ask, Refuse, Stop, Confirm, Recover) and a nine-category trajectory-failure taxonomy with hierarchical labels. It also measures how much of a model's performance stems from prompt supervision versus inherent capability. The work addresses the need for deployable agent evaluation beyond single accuracy scores.

Key facts

  • AgentAtlas introduces a six-state control-decision taxonomy: Act, Ask, Refuse, Stop, Confirm, Recover.
  • A nine-category trajectory-failure taxonomy is included with orthogonal labels: primary_error_source and impact.
  • The framework includes a taxonomy-aware vs. taxonomy-blind methodology.
  • Current benchmarks are fragmented, each emphasizing different units of measurement.
  • A line of 2024-2025 work has converged on the diagnosis that single accuracy is insufficient.
  • AgentAtlas extends this line of work with four components.
  • The benchmark-coverage audit mapping is part of AgentAtlas.
  • The work focuses on deployable agents acting on codebases, browsers, operating systems, etc.

Entities

Institutions

  • arXiv

Sources