AgencyBench: Benchmarking LLM Agents in 1M-Token Real-World Contexts

ai-technology · 2026-04-25

AgencyBench is a new benchmark for evaluating large language model (LLM)-based autonomous agents across 32 real-world scenarios requiring up to 1 million tokens and hours of execution time. It comprises 138 tasks with specific queries, deliverables, and rubrics, covering 6 core agentic capabilities. The benchmark uses a user simulation agent for iterative feedback and a Docker sandbox for automated visual and functional evaluation, addressing the scalability bottleneck of human-in-the-loop feedback. AgencyBench is derived from daily AI usage and aims to capture long-horizon, complex tasks that existing benchmarks fail to represent.

Key facts

AgencyBench is introduced as a comprehensive benchmark for LLM-based autonomous agents.
It evaluates 6 core agentic capabilities across 32 real-world scenarios.
The benchmark includes 138 tasks with specific queries, deliverables, and rubrics.
Tasks require an average of 90 tool calls, 1 million tokens, and hours of execution time.
Automated evaluation uses a user simulation agent for iterative feedback.
A Docker sandbox conducts visual and functional rubric-based evaluation.
The benchmark addresses the scalability bottleneck of human-in-the-loop feedback.
AgencyBench is derived from daily AI usage to capture long-horizon real-world scenarios.

AgencyBench: Benchmarking LLM Agents in 1M-Token Real-World Contexts

Key facts

Entities

Institutions

Sources