Auditing Reasoning-Trace Memorization Claims after Unlearning with Head-Conditioned Canaries

other · 2026-05-20

A recent preprint on arXiv (2605.18891) examines the assertion that reasoning models retain memorized information in their cognitive traces even after undergoing unlearning. Researchers utilized DeepSeek-R1-Distill-Qwen-7B with LoRA-memorized fictional authors and NPO unlearning, based on a six-token canary head. They discovered that substituting the thinking trace with a brief non-canary prefill on identical weights led to a significant reduction in the answer rate, matching the bypass gap on one seed, irrespective of whether the prefill resembled the training template. In contrast, on another seed, the bypass gap diminished, and the prefill swap reversed, elevating the answer rate to its maximum. The findings suggest that a positive parser-split bypass gap alone does not confirm or negate hidden weight-level memorization. The research was conducted by authors from unspecified institutions and shared on arXiv.

Key facts

arXiv:2605.18891v1 is a cross-type announcement.
The study audits bypass patterns in reasoning models after unlearning.
DeepSeek-R1-Distill-Qwen-7B was used with LoRA-memorized fictional authors.
NPO unlearning was applied, conditioned on a six-token canary head.
On one seed, swapping the thinking trace for a short non-canary prefill dropped answer rate as much as the bypass gap.
On a second seed, the bypass gap shrank and the prefill swap reversed direction.
A positive parser-split bypass gap does not definitively indicate weight-level memorization.
The same metric flipped sign on a different distillate.

Auditing Reasoning-Trace Memorization Claims after Unlearning with Head-Conditioned Canaries

Key facts

Entities

Institutions

Sources