ARTFEED — Contemporary Art Intelligence

AgencyBench: Benchmarking LLM Agents in 1M-Token Real-World Contexts

ai-technology · 2026-04-25

AgencyBench is a new benchmark for evaluating large language model (LLM)-based autonomous agents across 32 real-world scenarios requiring up to 1 million tokens and hours of execution time. It comprises 138 tasks with specific queries, deliverables, and rubrics, covering 6 core agentic capabilities. The benchmark uses a user simulation agent for iterative feedback and a Docker sandbox for automated visual and functional evaluation, addressing the scalability bottleneck of human-in-the-loop feedback. AgencyBench is derived from daily AI usage and aims to capture long-horizon, complex tasks that existing benchmarks fail to represent.

Key facts

  • AgencyBench is introduced as a comprehensive benchmark for LLM-based autonomous agents.
  • It evaluates 6 core agentic capabilities across 32 real-world scenarios.
  • The benchmark includes 138 tasks with specific queries, deliverables, and rubrics.
  • Tasks require an average of 90 tool calls, 1 million tokens, and hours of execution time.
  • Automated evaluation uses a user simulation agent for iterative feedback.
  • A Docker sandbox conducts visual and functional rubric-based evaluation.
  • The benchmark addresses the scalability bottleneck of human-in-the-loop feedback.
  • AgencyBench is derived from daily AI usage to capture long-horizon real-world scenarios.

Entities

Institutions

  • arXiv

Sources