ARTFEED — Contemporary Art Intelligence

Canary Tokens for Identifying AI Web Scrapers

ai-technology · 2026-05-14

A novel approach employs canary tokens to autonomously detect web scrapers utilized by large language models (LLMs). This strategy, detailed in a paper on arXiv (2605.13706), tackles the shortcomings of current identification techniques that depend on voluntary reporting, isolated tests, or community-driven information. By providing decoy content, website operators can identify scrapers that interact with these tokens, facilitating a more efficient implementation of access control tools such as the Robots Exclusion Protocol. The goal of this method is to assist website owners in curbing LLM-related scraping, which can jeopardize site stability and raise issues related to legality, privacy, or ethics.

Key facts

  • arXiv paper 2605.13706 proposes canary tokens for identifying LLM web scrapers.
  • Existing identification methods rely on voluntary disclosure, experiments, or crowd-sourced reports.
  • Canary tokens are decoy content that trigger alerts when accessed.
  • The technique aims to improve scraper access control mechanisms like the Robots Exclusion Protocol.
  • LLM web scraping can affect site stability and raise legal, privacy, or ethics concerns.
  • The method is designed to be reliable and scalable.
  • It allows automatic inference of LLM-related scrapers.
  • The paper is hosted on arXiv.

Entities

Institutions

  • arXiv

Sources