ARTFEED — Contemporary Art Intelligence

WebGameBench: Evaluating Coding Agents via Browser Games

other · 2026-05-20

WebGameBench is an innovative benchmark designed to assess the capability of coding agents in transforming a structured specification into a game that can be accessed via a browser. Instead of focusing on source code or intermediate outputs, it analyzes the final product. Each game created is built, served, and presented through a standardized protocol, followed by evaluation in a real browser environment. This process results in a classification of EXCELLENT, USABLE, or UNUSABLE. The benchmark employs browser-native games as a compact yet behavior-rich testing ground, necessitating coordinated input management, spatial mapping, rule implementation, state changes, terminal conditions, restart behavior, and visible feedback. A subset of the results is validated by human reviewers.

Key facts

  • WebGameBench evaluates coding agents on requirement-to-application tasks.
  • It uses browser-native games as testbeds.
  • Generated artifacts are built, served, and exposed under a unified deployment protocol.
  • A runtime evaluator assigns labels: EXCELLENT, USABLE, or UNUSABLE.
  • A human-reviewed subset confirms runtime labels.
  • Games require input handling, spatial mapping, rule execution, state transitions, terminal conditions, restart behavior, and visible feedback.
  • The benchmark focuses on delivered applications, not source code or intermediate traces.
  • WebGameBench is introduced in arXiv paper 2605.17637.

Entities

Institutions

  • arXiv

Sources