WindowsWorld Benchmark Tests GUI Agents on Cross-Application Workflows

other · 2026-05-01

A new benchmark called WindowsWorld has been developed by researchers to assess GUI agents performing intricate, multi-step tasks that necessitate collaboration among various desktop applications. In contrast to current benchmarks that concentrate on standalone application tasks, WindowsWorld replicates genuine professional workflows. This benchmark employs a multi-agent framework influenced by 16 different occupations to create tasks at four varying levels of difficulty, which are subsequently refined through human evaluation and carried out in a simulated setting. It includes 181 tasks, averaging 5.0 sub-goals across 17 widely-used desktop applications, with 78% of the tasks being inherently multi-application. The findings were shared on arXiv.

Key facts

WindowsWorld is a benchmark for GUI agents in cross-application workflows.
It addresses the gap in existing benchmarks that focus on single-application tasks.
The benchmark uses a multi-agent framework steered by 16 occupations.
Tasks are generated at four difficulty levels with intermediate inspection.
Tasks are refined by human review and executed in a simulated environment.
WindowsWorld contains 181 tasks with an average of 5.0 sub-goals.
The tasks span 17 common desktop applications.
78% of the tasks are inherently multi-application.

Entities

—

Sources

arXiv cs.AI — 2026-05-01