StressWeb Benchmark Tests LLM Web Agent Robustness Against Realistic Interaction Variability

ai-technology · 2026-04-22

A new diagnostic benchmark called StressWeb has been introduced to evaluate the robustness of large language model-based web agents. Existing evaluations often occur in stable, well-behaved interaction conditions, potentially overestimating agent performance. StressWeb addresses this by creating realistic, controllable web environments that serve as reference baselines with clean, stable workflows. The framework then introduces structured, controlled perturbations that emulate real-world interaction variability. These perturbations include shifting layouts, altered interaction semantics, and execution disruptions. By systematically comparing agent behavior between these clean baseline settings and the perturbed environments, the benchmark enables a diagnosis of robustness under various "what-if" scenarios. The research, documented in arXiv:2604.16385v1, highlights that high task success in idealized settings may not reflect performance in realistic web interactions. This work aims to provide a more accurate assessment of how these agents would perform when faced with the unpredictable nature of actual web use.

Key facts

A diagnostic stress-testing benchmark named StressWeb has been introduced for web agents.
Large language model-based web agents have shown strong performance on realistic web interaction tasks.
Existing evaluations are predominantly conducted under relatively stable and well-behaved interaction conditions.
High task success in idealized settings may overestimate agent robustness and not reflect realistic performance.
StressWeb constructs realistic and controllable web environments as reference baselines.
The framework introduces structured and controlled perturbations that emulate interaction variability.
Perturbations include shifting layouts, altered interaction semantics, and execution disruptions.
By comparing agent behavior between clean and perturbed settings, the framework enables systematic diagnosis of robustness.

Entities

—

Sources

arXiv cs.AI — 2026-04-21