WindowsWorld Benchmark Tests GUI Agents on Cross-Application Workflows
A new benchmark called WindowsWorld has been developed by researchers to assess GUI agents performing intricate, multi-step tasks that necessitate collaboration among various desktop applications. In contrast to current benchmarks that concentrate on standalone application tasks, WindowsWorld replicates genuine professional workflows. This benchmark employs a multi-agent framework influenced by 16 different occupations to create tasks at four varying levels of difficulty, which are subsequently refined through human evaluation and carried out in a simulated setting. It includes 181 tasks, averaging 5.0 sub-goals across 17 widely-used desktop applications, with 78% of the tasks being inherently multi-application. The findings were shared on arXiv.
Key facts
- WindowsWorld is a benchmark for GUI agents in cross-application workflows.
- It addresses the gap in existing benchmarks that focus on single-application tasks.
- The benchmark uses a multi-agent framework steered by 16 occupations.
- Tasks are generated at four difficulty levels with intermediate inspection.
- Tasks are refined by human review and executed in a simulated environment.
- WindowsWorld contains 181 tasks with an average of 5.0 sub-goals.
- The tasks span 17 common desktop applications.
- 78% of the tasks are inherently multi-application.
Entities
—