Claude Leads LLM Web Generation in 8-Week Study

ai-technology · 2026-05-11

An article recently published on arXiv (2605.06707) details an observational study spanning eight weeks, comparing 68 single-file HTML outputs from four families of reasoning models—GPT, Gemini, Grok, and Claude—gathered during 17 public experiments in the "HTML AI Battle" initiative from December 10, 2025, to February 4, 2026. Each output was assessed based on rendered browser video, utilizing human evaluations and a Gemini LLM-as-a-judge layer to measure prompt adherence, functional accuracy, and UI quality, all conducted under a standardized public-interface protocol without custom instructions or personality adjustments. The results were formatted for social media platforms, including X (Twitter), TikTok, and YouTube. Claude emerged as the top-performing family, consistently achieving the highest mean scores and winning the most comparisons.

Key facts

68 single-file HTML generations were compared across 17 public experiments.
Experiments ran from December 10, 2025 to February 4, 2026.
Four model families tested: GPT, Gemini, Grok, and Claude.
No custom instructions, personality tuning, or repair prompts were used.
Evaluation used human scores and a Gemini LLM-as-a-judge layer.
Outputs were shared on X, TikTok, and YouTube.
Two predictive models were built: for X impressions and HTML verbosity.
Claude was the strongest and most consistent family.

Entities

Institutions

arXiv
GPT
Gemini
Grok
Claude
X
TikTok
YouTube

Sources

arXiv cs.AI — 2026-05-11