Probe-Geometry Alignment Erases LLM Memorization Below Chance
A new study on arXiv (2605.01699) demonstrates that adversarial probes can detect memorization traces in large language models even after behavioral unlearning, but these traces can be surgically removed without capability loss. The protocol uses a leave-one-out cross-sequence probe to test generalization of memorization signatures across held-out sequences. Signatures were consistent across scales: memorization-specific gaps of +0.32, +0.19, and +0.30 on Pythia-70M, GPT-2 medium, and Mistral-7B respectively. On Pythia-70M, random-initialization control collapsed to -0.04 at the deepest layer where pretrained signature peaked. The probe direction is causally separable from recall: projecting it out collapses the signature locally (+0.44 to -0.19) while behavioral recall barely changes. A probe trained on naturally memorized content does not classify fine-tuning-injected secrets, indicating two distinct representations.
Key facts
- arXiv paper 2605.01699 on probe-geometry alignment
- Adversarial probes detect memorization traces after unlearning
- Leave-one-out cross-sequence probe used
- Memorization gaps: +0.32 (Pythia-70M), +0.19 (GPT-2 medium), +0.30 (Mistral-7B)
- Random-initialization control collapses to -0.04 on Pythia-70M
- Probe direction separable from recall: signature drops from +0.44 to -0.19
- Behavioral recall barely changes after projection
- Probe on naturally memorized content does not classify injected secrets
Entities
Institutions
- arXiv