JobBench: New Benchmark Evaluates AI Agents on Human-Preferred Tasks
Researchers have introduced JobBench, a benchmark that evaluates AI agents on workflows experts identify as high-priority for delegation, shifting focus from economic replacement to human empowerment. JobBench covers 130 agentic tasks across 35 occupations, each packaged as a workspace of heterogeneous reference files requiring reasoning through cluttered information streams. Outputs are graded by a fact-anchored chain of rubrics averaging 35.6 binary criteria per task. The strongest model, Claude Opus 4.7 under Claude Code, achieves only 45.9%. The benchmark aims to redirect the community's target labor-market effect from replacement to enhancement, building agents that do what humans actually want delegated rather than what is most economically valuable.
Key facts
- JobBench evaluates AI agents on workflows experts identify as high-priority for delegation
- Covers 130 agentic tasks across 35 occupations
- Each task includes a workspace of heterogeneous reference files
- Graded by fact-anchored chain of rubrics averaging 35.6 binary criteria per task
- 36 models evaluated; Claude Opus 4.7 under Claude Code reaches 45.9%
- Aims to shift focus from replacement to enhancement in labor-market effect
Entities
Institutions
- arXiv