ARTFEED — Contemporary Art Intelligence

Statistical Framework Quantifies AI Agent Reliability Under Perturbations

ai-technology · 2026-05-12

A new paper on arXiv (2605.10516) establishes a rigorous measurement science for AI agent reliability, introducing statistical methods to quantify consistency under semantically preserving perturbations. The framework uses U-statistics for output-level reliability and kernel-based metrics for trajectory-level stability, distinguishing between core capability and execution robustness. Experiments on three agentic benchmarks show that trajectory-level metrics offer greater diagnostic sensitivity than traditional pass@1 rates, revealing that minor task variations can cause complete strategy breakdowns even when agents possess requisite knowledge. The work provides mathematical tools to isolate where and why agents deviate.

Key facts

  • Paper establishes rigorous measurement science for AI agent reliability
  • Uses U-statistics for output-level reliability
  • Uses kernel-based metrics for trajectory-level stability
  • Distinguishes between core capability and execution robustness
  • Validated on three agentic benchmarks
  • Trajectory-level metrics more sensitive than pass@1 rates
  • Minor task variations can cause strategy breakdowns
  • Provides mathematical tools to isolate deviations

Entities

Institutions

  • arXiv

Sources