Vibe Code Bench: New Benchmark Tests AI on Full Web App Development

ai-technology · 2026-05-07

Vibe Code Bench has been launched by researchers as a new benchmark aimed at assessing AI models in the realm of end-to-end web application development. Unlike traditional benchmarks that concentrate on specific coding tasks, this one evaluates the entire process of creating a functional application from the ground up. It includes 100 specifications for web applications (50 for public validation and 50 for held-out testing), featuring 964 workflows that encompass 10,131 substeps. An autonomous browser agent conducts evaluations on deployed applications. Among 16 advanced models, the highest accuracy achieved on the test set is 61.8%, highlighting the ongoing challenges in reliable end-to-end application development. The study finds that self-testing during generation is a significant predictor of performance (Pearson r=0.72). Additionally, a human alignment study reveals that the choice of evaluators significantly influences results, with step-level agreement varying from 31.8% to 93.6%. This benchmark dataset and its evaluation approach represent significant advancements in AI code generation research.

Key facts

Vibe Code Bench is a benchmark for end-to-end web application development.
It includes 100 web application specifications (50 public, 50 held-out).
There are 964 browser-based workflows with 10,131 substeps.
Evaluations use an autonomous browser agent against deployed applications.
Best model accuracy is 61.8% on the test split across 16 frontier models.
Self-testing during generation is a strong performance predictor (Pearson r=0.72).
Human alignment study shows evaluator selection affects outcomes (31.8-93.6% agreement).
The benchmark is from arXiv paper 2603.04601.

Vibe Code Bench: New Benchmark Tests AI on Full Web App Development

Key facts

Entities

Institutions

Sources