JobBench: New Benchmark Evaluates AI Agents on Human-Preferred Tasks

ai-technology · 2026-05-27

Researchers have introduced JobBench, a benchmark that evaluates AI agents on workflows experts identify as high-priority for delegation, shifting focus from economic replacement to human empowerment. JobBench covers 130 agentic tasks across 35 occupations, each packaged as a workspace of heterogeneous reference files requiring reasoning through cluttered information streams. Outputs are graded by a fact-anchored chain of rubrics averaging 35.6 binary criteria per task. The strongest model, Claude Opus 4.7 under Claude Code, achieves only 45.9%. The benchmark aims to redirect the community's target labor-market effect from replacement to enhancement, building agents that do what humans actually want delegated rather than what is most economically valuable.

Key facts

JobBench evaluates AI agents on workflows experts identify as high-priority for delegation
Covers 130 agentic tasks across 35 occupations
Each task includes a workspace of heterogeneous reference files
Graded by fact-anchored chain of rubrics averaging 35.6 binary criteria per task
36 models evaluated; Claude Opus 4.7 under Claude Code reaches 45.9%
Aims to shift focus from replacement to enhancement in labor-market effect

JobBench: New Benchmark Evaluates AI Agents on Human-Preferred Tasks

Key facts

Entities

Institutions

Sources