ARTFEED — Contemporary Art Intelligence

RobustBench-TC: Benchmarking Sim-to-Real Gap in Tool-Use Language Agents

ai-technology · 2026-05-13

A new benchmark, RobustBench-TC, exposes the sim-to-real gap in tool-use language agents by introducing 22 perturbation types grounded in real-world failures from GitHub issues. The perturbations are organized by four components of the tool-use POMDP: observation, action space, reward-relevant metadata, and transition dynamics. Testing across 21 models from 1.5B to 32B parameters, including the closed-source o4-mini, reveals sharply uneven robustness: observation perturbations reduce accuracy by less than other types. The work highlights that current benchmarks assume clean inputs, unambiguous tool registries, and reliable APIs, while real deployments suffer from user typos, misconfigured timeouts, and duplicate tool names.

Key facts

  • RobustBench-TC includes 22 perturbation types
  • Perturbations are grounded in verified GitHub issues or documented tool-calling failures
  • 21 models tested from 1.5B to 32B parameters
  • Closed-source o4-mini included in evaluation
  • Observation perturbations reduce accuracy less than other types
  • Perturbations organized by four POMDP components
  • Real deployments face user typos, misconfigured timeouts, duplicate tool names
  • Study published on arXiv with ID 2605.11928

Entities

Institutions

  • arXiv

Sources