TerminalWorld benchmark tests AI agents on real terminal tasks
Researchers have introduced TerminalWorld, a scalable data engine that automatically reverse-engineers evaluation tasks from real-world terminal recordings. Processing 80,870 recordings, the engine produced a benchmark of 1,530 validated tasks across 18 categories, covering 1,280 unique commands, with workflows ranging from simple operations to over 50 steps. A curated subset of 200 manually reviewed tasks, TerminalWorld-Verified, was used to benchmark eight frontier models and six agents, achieving a maximum pass rate of only 62.5%. The benchmark captures capabilities distinct from existing expert-curated benchmarks like Terminal-Bench, with a weak correlation (Pearson r=0.20). The automated engine allows continuous expansion as new recordings are added.
Key facts
- TerminalWorld is a scalable data engine for reverse-engineering evaluation tasks from terminal recordings.
- 80,870 terminal recordings were processed.
- The benchmark includes 1,530 validated tasks across 18 real-world categories.
- Tasks cover 1,280 unique commands.
- Workflows range from short operations to those exceeding 50 steps.
- TerminalWorld-Verified is a subset of 200 manually reviewed tasks.
- Eight frontier models and six agents were benchmarked.
- Maximum pass rate on TerminalWorld-Verified was 62.5%.
- Weak correlation (Pearson r=0.20) with Terminal-Bench scores.
- The engine can continuously expand the benchmark.
Entities
—