Interaction-Layer Watermarks to Detect LLM Knowledge Distillation

ai-technology · 2026-05-20

A recent preprint on arXiv (2605.16462v1) introduces interaction-layer antidistillation watermarks aimed at identifying unauthorized knowledge distillation from LLM APIs in use. Current protective measures, like green-list watermarks and cryptographic techniques, are susceptible to paraphrasing attacks that eliminate signals while retaining knowledge. This new approach shifts the trace to the teacher's interaction patterns by employing a system prompt that sporadically elicits behavioral indicators—such as explicit follow-up inquiries, low-frequency variations, or declarative rephrasing. An oblivious distiller adopts these behaviors, allowing the defender to perform audits through black-box queries with a human-validated LLM serving as a judge. This method tackles the issue of the defender lacking control over the attacker's training process and next-token logits.

Key facts

arXiv:2605.16462v1 proposes interaction-layer antidistillation watermarks.
Existing defenses like green-list watermarks are vulnerable to paraphrasing attacks.
The method induces behavioral markers via system prompts.
Markers include follow-up questions, low-frequency variants, or declarative restatements.
Defender audits via black-box queries with LLM-as-judge.
The defender cannot control the attacker's training pipeline or logits.
The approach targets unauthorized knowledge distillation from deployed LLM APIs.
The preprint is categorized as 'cross' in arXiv.

Interaction-Layer Watermarks to Detect LLM Knowledge Distillation

Key facts

Entities

Institutions

Sources