ARTFEED — Contemporary Art Intelligence

TerminalWorld benchmark tests AI agents on real terminal tasks

ai-technology · 2026-05-23

Researchers have introduced TerminalWorld, a scalable data engine that automatically reverse-engineers evaluation tasks from real-world terminal recordings. Processing 80,870 recordings, the engine produced a benchmark of 1,530 validated tasks across 18 categories, covering 1,280 unique commands, with workflows ranging from simple operations to over 50 steps. A curated subset of 200 manually reviewed tasks, TerminalWorld-Verified, was used to benchmark eight frontier models and six agents, achieving a maximum pass rate of only 62.5%. The benchmark captures capabilities distinct from existing expert-curated benchmarks like Terminal-Bench, with a weak correlation (Pearson r=0.20). The automated engine allows continuous expansion as new recordings are added.

Key facts

  • TerminalWorld is a scalable data engine for reverse-engineering evaluation tasks from terminal recordings.
  • 80,870 terminal recordings were processed.
  • The benchmark includes 1,530 validated tasks across 18 real-world categories.
  • Tasks cover 1,280 unique commands.
  • Workflows range from short operations to those exceeding 50 steps.
  • TerminalWorld-Verified is a subset of 200 manually reviewed tasks.
  • Eight frontier models and six agents were benchmarked.
  • Maximum pass rate on TerminalWorld-Verified was 62.5%.
  • Weak correlation (Pearson r=0.20) with Terminal-Bench scores.
  • The engine can continuously expand the benchmark.

Entities

Sources