AgentHijack Benchmark Tests MLLM Agent Robustness
A new benchmark called AgentHijack evaluates the robustness of computer-use agents powered by multimodal large language models (MLLMs) against common environmental corruptions. The benchmark introduces nine configurable corruptions such as pop-ups, resolution changes, and competing applications that disrupt agent execution without adversarial intent. Testing on desktop tasks reveals that even minor corruptions cause significant performance degradation, highlighting the fragility of current agents and the need for robustness evaluation. The research is published on arXiv under paper number 2605.25707.
Key facts
- AgentHijack is a benchmark for computer-use agent robustness.
- It targets MLLM-based agents.
- Nine common corruptions are introduced.
- Corruptions include pop-ups, resolution changes, and competing apps.
- Minor corruptions cause substantial performance drops.
- The benchmark emphasizes fragility of agents.
- The paper is on arXiv: 2605.25707.
- The research underscores necessity of robustness evaluation.
Entities
Institutions
- arXiv