ARTFEED — Contemporary Art Intelligence

ComplexMCP Benchmark Reveals LLM Agents Fail at Real-World Tool Use

ai-technology · 2026-05-12

A new benchmark called ComplexMCP evaluates LLM agents on interdependent tool use in dynamic environments, revealing a significant performance gap compared to humans. Built on the Model Context Protocol (MCP), the benchmark includes over 300 tools from 7 stateful sandboxes covering office suites and financial systems. It uses a seed-driven architecture to simulate dynamic states and unpredictable API failures. Evaluations across full-context and RAG paradigms show top-tier models achieve less than 60% success rate, while humans reach 90%.

Key facts

  • ComplexMCP is a benchmark for LLM agents in dynamic, interdependent tool sandboxes.
  • It provides over 300 tools from 7 stateful sandboxes.
  • Tools are atomic, interdependent, and prone to environmental noise.
  • Benchmark uses seed-driven architecture for dynamic states and API failures.
  • Top-tier LLMs fail to exceed 60% success rate.
  • Human performance is 90%.
  • Evaluation covers full-context and RAG paradigms.
  • Built on the Model Context Protocol (MCP).

Entities

Sources