Canary Tokens for Identifying AI Web Scrapers
A novel approach employs canary tokens to autonomously detect web scrapers utilized by large language models (LLMs). This strategy, detailed in a paper on arXiv (2605.13706), tackles the shortcomings of current identification techniques that depend on voluntary reporting, isolated tests, or community-driven information. By providing decoy content, website operators can identify scrapers that interact with these tokens, facilitating a more efficient implementation of access control tools such as the Robots Exclusion Protocol. The goal of this method is to assist website owners in curbing LLM-related scraping, which can jeopardize site stability and raise issues related to legality, privacy, or ethics.
Key facts
- arXiv paper 2605.13706 proposes canary tokens for identifying LLM web scrapers.
- Existing identification methods rely on voluntary disclosure, experiments, or crowd-sourced reports.
- Canary tokens are decoy content that trigger alerts when accessed.
- The technique aims to improve scraper access control mechanisms like the Robots Exclusion Protocol.
- LLM web scraping can affect site stability and raise legal, privacy, or ethics concerns.
- The method is designed to be reliable and scalable.
- It allows automatic inference of LLM-related scrapers.
- The paper is hosted on arXiv.
Entities
Institutions
- arXiv