Canary Tokens for Identifying AI Web Scrapers

ai-technology · 2026-05-14

A novel approach employs canary tokens to autonomously detect web scrapers utilized by large language models (LLMs). This strategy, detailed in a paper on arXiv (2605.13706), tackles the shortcomings of current identification techniques that depend on voluntary reporting, isolated tests, or community-driven information. By providing decoy content, website operators can identify scrapers that interact with these tokens, facilitating a more efficient implementation of access control tools such as the Robots Exclusion Protocol. The goal of this method is to assist website owners in curbing LLM-related scraping, which can jeopardize site stability and raise issues related to legality, privacy, or ethics.

Key facts

arXiv paper 2605.13706 proposes canary tokens for identifying LLM web scrapers.
Existing identification methods rely on voluntary disclosure, experiments, or crowd-sourced reports.
Canary tokens are decoy content that trigger alerts when accessed.
The technique aims to improve scraper access control mechanisms like the Robots Exclusion Protocol.
LLM web scraping can affect site stability and raise legal, privacy, or ethics concerns.
The method is designed to be reliable and scalable.
It allows automatic inference of LLM-related scrapers.
The paper is hosted on arXiv.

Canary Tokens for Identifying AI Web Scrapers

Key facts

Entities

Institutions

Sources