Open Agent Leaderboard Launches to Benchmark Full AI Agent Systems

ai-technology · 2026-05-18

IBM Research has launched the Open Agent Leaderboard, an open benchmark designed to evaluate complete AI agent systems rather than just the underlying models. The leaderboard assesses agents across six diverse benchmarks—SWE-Bench Verified, BrowseComp+, AppWorld, tau2-Bench Airline & Retail, and tau2-Bench Telecom—covering coding, research, personal assistance, customer service, and technical support. It reports both quality and cost per task, enabling comparison of real-world deployability. The initiative is paired with the Exgentic framework for reproducing evaluations and a paper detailing methodology. Early findings show that general-purpose agents can match specialized ones, and that agent architecture (e.g., tool shortlisting) significantly impacts performance. The leaderboard is open for community contributions of new agents, benchmarks, and models.

Key facts

Open Agent Leaderboard launched by IBM Research.
Evaluates full agent systems, not just models.
Six benchmarks: SWE-Bench Verified, BrowseComp+, AppWorld, tau2-Bench Airline & Retail, tau2-Bench Telecom.
Reports quality and cost per task.
Paired with Exgentic framework and a methodology paper.
General-purpose agents competitive with specialized ones.
Tool shortlisting improves performance across models.
Open for community submissions via PR on results dataset.

Open Agent Leaderboard Launches to Benchmark Full AI Agent Systems

Key facts

Entities

Institutions

Sources