Flawed Benchmarks Overstate AI Agent Performance
A recent study indicates that a 1MB replay script, which executes recorded actions without screen observation, surpasses leading models on key static benchmarks for Computer Use Agents (CUAs). The researchers demonstrate that in deterministic settings, the expected success rate of this script matches the source agent's pass@k metric, highlighting a significant flaw in existing evaluation techniques. They identify two primary causes for these shortcomings: poorly designed environments (static, unsandboxed, or inadequately verified) and flawed evaluation methods (naive aggregation and inappropriate use of pass@k in stateful UI interactions). To tackle the first issue, the authors introduce PRISM, a framework of five design principles for CUA environments: privileged verification, realistic settings, integrity-checked configurations, sandboxed execution, and multifactorial variability. This study is available on arXiv with the identifier 2605.08261.
Key facts
- A 1MB replay script that never observes the screen outperforms frontier models on static CUA benchmarks.
- The script's expected success rate equals the source agent's pass@k in deterministic environments.
- Two root causes identified: non-principled environment design and non-principled evaluation methodology.
- PRISM is proposed as five design principles for CUA environments.
- The study is available on arXiv with ID 2605.08261.
Entities
Institutions
- arXiv