WebGameBench: Evaluating Coding Agents via Browser Games
WebGameBench is an innovative benchmark designed to assess the capability of coding agents in transforming a structured specification into a game that can be accessed via a browser. Instead of focusing on source code or intermediate outputs, it analyzes the final product. Each game created is built, served, and presented through a standardized protocol, followed by evaluation in a real browser environment. This process results in a classification of EXCELLENT, USABLE, or UNUSABLE. The benchmark employs browser-native games as a compact yet behavior-rich testing ground, necessitating coordinated input management, spatial mapping, rule implementation, state changes, terminal conditions, restart behavior, and visible feedback. A subset of the results is validated by human reviewers.
Key facts
- WebGameBench evaluates coding agents on requirement-to-application tasks.
- It uses browser-native games as testbeds.
- Generated artifacts are built, served, and exposed under a unified deployment protocol.
- A runtime evaluator assigns labels: EXCELLENT, USABLE, or UNUSABLE.
- A human-reviewed subset confirms runtime labels.
- Games require input handling, spatial mapping, rule execution, state transitions, terminal conditions, restart behavior, and visible feedback.
- The benchmark focuses on delivered applications, not source code or intermediate traces.
- WebGameBench is introduced in arXiv paper 2605.17637.
Entities
Institutions
- arXiv