RobustBench-TC: Benchmarking Sim-to-Real Gap in Tool-Use Language Agents

ai-technology · 2026-05-13

A new benchmark, RobustBench-TC, exposes the sim-to-real gap in tool-use language agents by introducing 22 perturbation types grounded in real-world failures from GitHub issues. The perturbations are organized by four components of the tool-use POMDP: observation, action space, reward-relevant metadata, and transition dynamics. Testing across 21 models from 1.5B to 32B parameters, including the closed-source o4-mini, reveals sharply uneven robustness: observation perturbations reduce accuracy by less than other types. The work highlights that current benchmarks assume clean inputs, unambiguous tool registries, and reliable APIs, while real deployments suffer from user typos, misconfigured timeouts, and duplicate tool names.

Key facts

RobustBench-TC includes 22 perturbation types
Perturbations are grounded in verified GitHub issues or documented tool-calling failures
21 models tested from 1.5B to 32B parameters
Closed-source o4-mini included in evaluation
Observation perturbations reduce accuracy less than other types
Perturbations organized by four POMDP components
Real deployments face user typos, misconfigured timeouts, duplicate tool names
Study published on arXiv with ID 2605.11928

RobustBench-TC: Benchmarking Sim-to-Real Gap in Tool-Use Language Agents

Key facts

Entities

Institutions

Sources