ComplexMCP Benchmark Reveals LLM Agents Fail at Real-World Tool Use
A new benchmark called ComplexMCP evaluates LLM agents on interdependent tool use in dynamic environments, revealing a significant performance gap compared to humans. Built on the Model Context Protocol (MCP), the benchmark includes over 300 tools from 7 stateful sandboxes covering office suites and financial systems. It uses a seed-driven architecture to simulate dynamic states and unpredictable API failures. Evaluations across full-context and RAG paradigms show top-tier models achieve less than 60% success rate, while humans reach 90%.
Key facts
- ComplexMCP is a benchmark for LLM agents in dynamic, interdependent tool sandboxes.
- It provides over 300 tools from 7 stateful sandboxes.
- Tools are atomic, interdependent, and prone to environmental noise.
- Benchmark uses seed-driven architecture for dynamic states and API failures.
- Top-tier LLMs fail to exceed 60% success rate.
- Human performance is 90%.
- Evaluation covers full-context and RAG paradigms.
- Built on the Model Context Protocol (MCP).
Entities
—