Language Models Detect and Identify Activation Perturbations with High Accuracy

ai-technology · 2026-04-22

Research demonstrates that large language models can identify when their internal activations have been altered. Experiments involved applying either dropout-like masking or Gaussian noise to specific sentences within the models' processing. Models from the Llama, Olmo, and Qwen families, ranging from 8B to 32B parameters, were tested. These systems could not only detect the presence of a perturbation but also pinpoint its location within a sequence. When presented with a multiple-choice format, models achieved high, sometimes perfect, accuracy in answering which sentence was modified. Furthermore, the models learned to distinguish between the two types of interference—dropout and noise—when provided with contextual examples. A notable finding showed that Qwen's ability to identify the perturbation type in a zero-shot setting improved as the strength of the interference increased. This capability, however, diminished if the instructional labels within the context were swapped or made incorrect. The study, detailed in the arXiv preprint 2604.17465v1, provides evidence that language models possess a form of introspective awareness regarding their own computational states.

Key facts

Language models can detect perturbations applied to their internal activations.
Two perturbation types were tested: dropout-like masking and Gaussian noise.
Tested models included families like Llama, Olmo, and Qwen.
Model sizes ranged from 8 billion to 32 billion parameters.
Models could localize which specific sentence was perturbed.
Models achieved high, often perfect, accuracy in detection tasks.
Models learned to distinguish between dropout and noise via in-context teaching.
Qwen's zero-shot identification accuracy varied with perturbation strength and label correctness.

Entities

—

Sources

arXiv cs.AI — 2026-04-21