Executable Benchmarking Suite for Tool-Using Agents

other · 2026-05-13

A recently developed executable benchmarking suite for closed-loop tool-using agents clarifies workloads, drivers, and evidence within a unified admission agreement. This suite incorporates WebArena Verified, a component of SWE-Gym featuring SWE-bench-compatible verification, along with MiniWoB++ through shared adapters, task manifests, event schemas, and reporting pipelines. It distinguishes between evidence for publications and preflight, fixture, smoke, and diagnostic entries while retaining non-admitted artifacts for auditing purposes. The records of admitted evidence document latency, invalid-action behavior, costs of patch generation, verifier metadata, replay bindings, and provenance.

Key facts

Suite connects WebArena Verified, SWE-Gym slice, and MiniWoB++
Uses common workload adapters, task manifests, event schemas
Separates paper-facing evidence from preflight, fixture, smoke, diagnostic rows
Preserves non-admitted artifacts for audit and onboarding
Records latency, invalid-action behavior, patch-generation cost, verifier metadata, replay bindings, provenance

Entities

—

Sources

arXiv cs.AI — 2026-05-13