AgencyBench: Benchmarking LLM Agents in 1M-Token Real-World Contexts
AgencyBench is a new benchmark for evaluating large language model (LLM)-based autonomous agents across 32 real-world scenarios requiring up to 1 million tokens and hours of execution time. It comprises 138 tasks with specific queries, deliverables, and rubrics, covering 6 core agentic capabilities. The benchmark uses a user simulation agent for iterative feedback and a Docker sandbox for automated visual and functional evaluation, addressing the scalability bottleneck of human-in-the-loop feedback. AgencyBench is derived from daily AI usage and aims to capture long-horizon, complex tasks that existing benchmarks fail to represent.
Key facts
- AgencyBench is introduced as a comprehensive benchmark for LLM-based autonomous agents.
- It evaluates 6 core agentic capabilities across 32 real-world scenarios.
- The benchmark includes 138 tasks with specific queries, deliverables, and rubrics.
- Tasks require an average of 90 tool calls, 1 million tokens, and hours of execution time.
- Automated evaluation uses a user simulation agent for iterative feedback.
- A Docker sandbox conducts visual and functional rubric-based evaluation.
- The benchmark addresses the scalability bottleneck of human-in-the-loop feedback.
- AgencyBench is derived from daily AI usage to capture long-horizon real-world scenarios.
Entities
Institutions
- arXiv