TerminalWorld benchmark tests AI agents on real terminal tasks

ai-technology · 2026-05-23

Researchers have introduced TerminalWorld, a scalable data engine that automatically reverse-engineers evaluation tasks from real-world terminal recordings. Processing 80,870 recordings, the engine produced a benchmark of 1,530 validated tasks across 18 categories, covering 1,280 unique commands, with workflows ranging from simple operations to over 50 steps. A curated subset of 200 manually reviewed tasks, TerminalWorld-Verified, was used to benchmark eight frontier models and six agents, achieving a maximum pass rate of only 62.5%. The benchmark captures capabilities distinct from existing expert-curated benchmarks like Terminal-Bench, with a weak correlation (Pearson r=0.20). The automated engine allows continuous expansion as new recordings are added.

Key facts

TerminalWorld is a scalable data engine for reverse-engineering evaluation tasks from terminal recordings.
80,870 terminal recordings were processed.
The benchmark includes 1,530 validated tasks across 18 real-world categories.
Tasks cover 1,280 unique commands.
Workflows range from short operations to those exceeding 50 steps.
TerminalWorld-Verified is a subset of 200 manually reviewed tasks.
Eight frontier models and six agents were benchmarked.
Maximum pass rate on TerminalWorld-Verified was 62.5%.
Weak correlation (Pearson r=0.20) with Terminal-Bench scores.
The engine can continuously expand the benchmark.

Entities

—

Sources

arXiv cs.AI — 2026-05-23