New Framework Diagnoses AI Agent Failures with Span-Level Precision

ai-technology · 2026-05-16

A comprehensive evaluation framework for AI agents has been created by researchers, integrating both top-down agent-level diagnostics and bottom-up span-level assessments. This framework breaks down the analysis into separate evaluations for each span, allowing it to scale to traces of any length and providing justifications for each conclusion. On the TRAIL benchmark, it demonstrates leading performance on GAIA and SWE-Bench, achieving relative improvements over previous benchmarks of up to 38% in category F1, 3.5 times in localization accuracy, and 12.5 times in joint localization-categorization accuracy. This method overcomes the shortcomings of existing evaluation techniques, which often fail to clarify reasons for success or failure and have difficulty pinpointing failure types within lengthy, structured traces.

Key facts

The framework pairs top-down agent-level diagnosis with bottom-up span-level evaluation.
It decomposes analysis into independent per-span assessments.
The framework scales to traces of arbitrary length.
It produces span-level rationales for each verdict.
On the TRAIL benchmark, it achieves state-of-the-art results on GAIA and SWE-Bench.
Relative gains over prior baselines: up to 38% on category F1.
Relative gains: up to 3.5x on localization accuracy.
Relative gains: up to 12.5x on joint localization-categorization accuracy.

Entities

—

Sources

arXiv cs.AI — 2026-05-16