ARTFEED — Contemporary Art Intelligence

JobBench: New Benchmark Evaluates AI Agents on Human-Preferred Tasks

ai-technology · 2026-05-27

Researchers have introduced JobBench, a benchmark that evaluates AI agents on workflows experts identify as high-priority for delegation, shifting focus from economic replacement to human empowerment. JobBench covers 130 agentic tasks across 35 occupations, each packaged as a workspace of heterogeneous reference files requiring reasoning through cluttered information streams. Outputs are graded by a fact-anchored chain of rubrics averaging 35.6 binary criteria per task. The strongest model, Claude Opus 4.7 under Claude Code, achieves only 45.9%. The benchmark aims to redirect the community's target labor-market effect from replacement to enhancement, building agents that do what humans actually want delegated rather than what is most economically valuable.

Key facts

  • JobBench evaluates AI agents on workflows experts identify as high-priority for delegation
  • Covers 130 agentic tasks across 35 occupations
  • Each task includes a workspace of heterogeneous reference files
  • Graded by fact-anchored chain of rubrics averaging 35.6 binary criteria per task
  • 36 models evaluated; Claude Opus 4.7 under Claude Code reaches 45.9%
  • Aims to shift focus from replacement to enhancement in labor-market effect

Entities

Institutions

  • arXiv

Sources