DeGenTWeb Reveals Widespread LLM-Generated Content on Websites

ai-technology · 2026-05-04

A new research paper, DeGenTWeb, presents a systematic method to identify websites dominated by content generated by large language models (LLMs) with minimal human input. The authors argue that previous claims about LLM content prevalence were based on non-representative samples and opaque methodologies, and that LLM text detectors perform poorly when minimizing false attributions of human text. DeGenTWeb adapts detectors for web pages and aggregates results across multiple pages for accurate site-level categorization. The study finds that LLM-dominant sites are highly prevalent, though specific numbers are not provided in the abstract. The paper is available on arXiv under identifier 2605.00087.

Key facts

DeGenTWeb systematically identifies LLM-dominant websites
LLM-dominant sites have content generated by LLMs with little human input
Previous claims about LLM content prevalence lacked representative samples
LLM text detectors perform worse than advertised when minimizing false positives
DeGenTWeb adapts detectors for web pages and aggregates results across pages
LLM-dominant sites are found to be highly prevalent
Paper available on arXiv: 2605.00087
Methodology aims for accurate site-level categorization

DeGenTWeb Reveals Widespread LLM-Generated Content on Websites

Key facts

Entities

Institutions

Sources