Study Examines Unreliability Factors in Computer-Use Agents for Real-World Tasks

ai-technology · 2026-04-22

A recent study explores the reasons behind the inconsistent performance of computer-use agents, which have made significant strides in areas such as web navigation and desktop automation but often fail to maintain reliability despite initial success. Researchers focused on three primary factors that contribute to this issue: randomness during execution, unclear task specifications, and fluctuations in agent behavior. Utilizing OSWorld, the study conducted multiple runs of the same tasks alongside statistical analyses to assess changes at the task level across various conditions. Results reveal that task definition and agent behavior variability are crucial for reliability. This research emphasizes the necessity for evaluation approaches that address these factors to enhance agent consistency. The findings were published on arXiv with the identifier 2604.17849v1, reflecting the persistent challenges in AI-driven automation despite advancements.

Key facts

Computer-use agents have improved on real-world tasks such as web navigation and desktop automation.
Agents may succeed at a task once but fail on repeated executions of the same task.
The study examines three factors: stochasticity during execution, ambiguity in task specification, and variability in agent behavior.
Analysis was conducted using OSWorld with repeated task executions and paired statistical tests.
Reliability depends on both task specification and agent behavior variability across executions.
The findings suggest a need for evaluation methods that address these unreliability sources.
The research is documented in arXiv:2604.17849v1.
In some cases, agents surpass human performance on specific tasks.

Study Examines Unreliability Factors in Computer-Use Agents for Real-World Tasks

Key facts

Entities

Institutions

Sources