WebGameBench: Evaluating Coding Agents via Browser Games

other · 2026-05-20

WebGameBench is an innovative benchmark designed to assess the capability of coding agents in transforming a structured specification into a game that can be accessed via a browser. Instead of focusing on source code or intermediate outputs, it analyzes the final product. Each game created is built, served, and presented through a standardized protocol, followed by evaluation in a real browser environment. This process results in a classification of EXCELLENT, USABLE, or UNUSABLE. The benchmark employs browser-native games as a compact yet behavior-rich testing ground, necessitating coordinated input management, spatial mapping, rule implementation, state changes, terminal conditions, restart behavior, and visible feedback. A subset of the results is validated by human reviewers.

Key facts

WebGameBench evaluates coding agents on requirement-to-application tasks.
It uses browser-native games as testbeds.
Generated artifacts are built, served, and exposed under a unified deployment protocol.
A runtime evaluator assigns labels: EXCELLENT, USABLE, or UNUSABLE.
A human-reviewed subset confirms runtime labels.
Games require input handling, spatial mapping, rule execution, state transitions, terminal conditions, restart behavior, and visible feedback.
The benchmark focuses on delivered applications, not source code or intermediate traces.
WebGameBench is introduced in arXiv paper 2605.17637.

WebGameBench: Evaluating Coding Agents via Browser Games

Key facts

Entities

Institutions

Sources