New AI System Diagnoses LLM Agent Failures at Scale

ai-technology · 2026-05-22

A new multi-agent system called the Insights Generator (IG) has been developed by researchers to identify systematic behavioral trends in large language model (LLM) agents by examining comprehensive execution trace corpora. The findings, detailed in the arXiv preprint 2605.21347, tackle the challenges of corpus-level trace diagnostics, overcoming the inadequacies of manual reviews that overlook broader population trends and are not suitable for production settings where traces can extend to tens of thousands of tokens. IG formulates and tests hypotheses across trace groups to answer diagnostic inquiries, generating evidence-based insights reports. The evaluation of the system included qualitative and objective metrics, such as rubric-based assessments and performance enhancements resulting from IG's suggestions, underscoring a move towards automated, scalable debugging for LLM agents in complex applications.

Key facts

arXiv preprint 2605.21347 introduces the Insights Generator (IG)
IG is a multi-agent system for corpus-level trace diagnostics
It analyzes entire corpora of execution traces to identify systematic behavioral patterns
Manual inspection of traces is limited to small subsets and ad-hoc hypotheses
Individual traces in production can span tens of thousands of tokens
IG answers diagnostic questions by proposing and testing hypotheses
Evaluation includes rubric-based report assessment and downstream performance improvements
The system produces evidence-backed insights reports

New AI System Diagnoses LLM Agent Failures at Scale

Key facts

Entities

Institutions

Sources