ComplexMCP Benchmark Reveals LLM Agents Fail at Real-World Tool Use

ai-technology · 2026-05-12

A new benchmark called ComplexMCP evaluates LLM agents on interdependent tool use in dynamic environments, revealing a significant performance gap compared to humans. Built on the Model Context Protocol (MCP), the benchmark includes over 300 tools from 7 stateful sandboxes covering office suites and financial systems. It uses a seed-driven architecture to simulate dynamic states and unpredictable API failures. Evaluations across full-context and RAG paradigms show top-tier models achieve less than 60% success rate, while humans reach 90%.

Key facts

ComplexMCP is a benchmark for LLM agents in dynamic, interdependent tool sandboxes.
It provides over 300 tools from 7 stateful sandboxes.
Tools are atomic, interdependent, and prone to environmental noise.
Benchmark uses seed-driven architecture for dynamic states and API failures.
Top-tier LLMs fail to exceed 60% success rate.
Human performance is 90%.
Evaluation covers full-context and RAG paradigms.
Built on the Model Context Protocol (MCP).

Entities

—

Sources

arXiv cs.AI — 2026-05-12