Workspace-Bench: Benchmarking AI Agents on Large-Scale File Dependency Tasks

ai-technology · 2026-05-07

Researchers have created a new standard called Workspace-Bench to evaluate AI agents tackling workspace learning tasks that rely heavily on file dependencies. This benchmark simulates real work settings with five different worker profiles and 74 file types, totaling 20,476 files (up to 20GB) and 388 unique tasks. Each task features its own file dependency graph and is assessed using 7,399 rubrics that test cross-file retrieval, contextual reasoning, and adaptive decision-making. For simplicity, there's also Workspace-Bench-Lite, which includes 100 tasks. This study addresses gaps in existing benchmarks that often use artificially created files lacking real-world applicability. You can check out the research paper on arXiv under the identifier 2605.03596.

Key facts

Workspace-Bench evaluates AI agents on workspace learning with large-scale file dependencies.
The benchmark includes 5 worker profiles, 74 file types, 20,476 files (up to 20GB).
There are 388 tasks, each with its own file dependency graph.
Evaluation uses 7,399 rubrics for cross-file retrieval, contextual reasoning, and adaptive decision-making.
Workspace-Bench-Lite is a 100-task subset.
Existing benchmarks lack real-world file dependencies.
The paper is on arXiv: 2605.03596.

Workspace-Bench: Benchmarking AI Agents on Large-Scale File Dependency Tasks

Key facts

Entities

Institutions

Sources