Statistical Framework Quantifies AI Agent Reliability Under Perturbations

ai-technology · 2026-05-12

A new paper on arXiv (2605.10516) establishes a rigorous measurement science for AI agent reliability, introducing statistical methods to quantify consistency under semantically preserving perturbations. The framework uses U-statistics for output-level reliability and kernel-based metrics for trajectory-level stability, distinguishing between core capability and execution robustness. Experiments on three agentic benchmarks show that trajectory-level metrics offer greater diagnostic sensitivity than traditional pass@1 rates, revealing that minor task variations can cause complete strategy breakdowns even when agents possess requisite knowledge. The work provides mathematical tools to isolate where and why agents deviate.

Key facts

Paper establishes rigorous measurement science for AI agent reliability
Uses U-statistics for output-level reliability
Uses kernel-based metrics for trajectory-level stability
Distinguishes between core capability and execution robustness
Validated on three agentic benchmarks
Trajectory-level metrics more sensitive than pass@1 rates
Minor task variations can cause strategy breakdowns
Provides mathematical tools to isolate deviations

Statistical Framework Quantifies AI Agent Reliability Under Perturbations

Key facts

Entities

Institutions

Sources