Probe-Geometry Alignment Erases LLM Memorization Below Chance

other · 2026-05-06

A new study on arXiv (2605.01699) demonstrates that adversarial probes can detect memorization traces in large language models even after behavioral unlearning, but these traces can be surgically removed without capability loss. The protocol uses a leave-one-out cross-sequence probe to test generalization of memorization signatures across held-out sequences. Signatures were consistent across scales: memorization-specific gaps of +0.32, +0.19, and +0.30 on Pythia-70M, GPT-2 medium, and Mistral-7B respectively. On Pythia-70M, random-initialization control collapsed to -0.04 at the deepest layer where pretrained signature peaked. The probe direction is causally separable from recall: projecting it out collapses the signature locally (+0.44 to -0.19) while behavioral recall barely changes. A probe trained on naturally memorized content does not classify fine-tuning-injected secrets, indicating two distinct representations.

Key facts

arXiv paper 2605.01699 on probe-geometry alignment
Adversarial probes detect memorization traces after unlearning
Leave-one-out cross-sequence probe used
Memorization gaps: +0.32 (Pythia-70M), +0.19 (GPT-2 medium), +0.30 (Mistral-7B)
Random-initialization control collapses to -0.04 on Pythia-70M
Probe direction separable from recall: signature drops from +0.44 to -0.19
Behavioral recall barely changes after projection
Probe on naturally memorized content does not classify injected secrets

Probe-Geometry Alignment Erases LLM Memorization Below Chance

Key facts

Entities

Institutions

Sources