Executable Benchmarking Suite for Tool-Using Agents
A recently developed executable benchmarking suite for closed-loop tool-using agents clarifies workloads, drivers, and evidence within a unified admission agreement. This suite incorporates WebArena Verified, a component of SWE-Gym featuring SWE-bench-compatible verification, along with MiniWoB++ through shared adapters, task manifests, event schemas, and reporting pipelines. It distinguishes between evidence for publications and preflight, fixture, smoke, and diagnostic entries while retaining non-admitted artifacts for auditing purposes. The records of admitted evidence document latency, invalid-action behavior, costs of patch generation, verifier metadata, replay bindings, and provenance.
Key facts
- Suite connects WebArena Verified, SWE-Gym slice, and MiniWoB++
- Uses common workload adapters, task manifests, event schemas
- Separates paper-facing evidence from preflight, fixture, smoke, diagnostic rows
- Preserves non-admitted artifacts for audit and onboarding
- Records latency, invalid-action behavior, patch-generation cost, verifier metadata, replay bindings, provenance
Entities
—