AgentEscapeBench: Benchmarking LLM Agents' Tool-Grounded Reasoning

ai-technology · 2026-05-11

AgentEscapeBench has been launched by researchers as a benchmark modeled after escape rooms, aimed at assessing the capacity of LLM-based agents to maintain tool-grounded reasoning beyond their usual workflows and short-term interactions. This benchmark evaluates whether agents can deduce, implement, and adjust new procedures for tool usage while adhering to explicit long-range dependency constraints. Each task features a directed acyclic graph concerning tools and items, necessitating agents to call upon actual external functions, manage hidden states revealed gradually, propagate intermediate results, and deliver a verifiable final answer. Comprising 270 instances across five levels of difficulty, AgentEscapeBench allows for fully automated assessments. Tests involving sixteen LLM agents and human subjects indicate a significant drop in performance with increasing dependency depth: humans fall from 98.3% success at difficulty-5 to 80.0% at difficulty-5. This benchmark serves as a stringent evaluation of agents’ reasoning skills in intricate, multi-step scenarios.

Key facts

AgentEscapeBench is an escape-room-style benchmark for LLM agents.
It evaluates tool-grounded reasoning under long-range dependency constraints.
Tasks involve directed acyclic dependency graphs over tools and items.
Agents must invoke real external functions and track hidden state.
The benchmark includes 270 instances across five difficulty tiers.
It supports fully automated evaluation.
Sixteen LLM agents and human participants were tested.
Human performance drops from 98.3% to 80.0% as difficulty increases.

Entities

—

Sources

arXiv cs.AI — 2026-05-11