Distinguishable Deletion: New Paradigm for LLM Unlearning

ai-technology · 2026-05-20

A novel approach known as Distinguishable Deletion (D²) has been introduced to enhance the unlearning process in large language models (LLMs). Current techniques are categorized into two types: Knowledge Deletion (KD), which eliminates unwanted information during the training phase, and Distinguishable Refusal (DR), which prevents models from utilizing sensitive information during inference. KD is limited by its tendency to selectively suppress certain token sequences rather than fully erasing knowledge, while DR may allow harmful information to resurface since the original knowledge remains. D² modifies the response distribution in the latent representation to effectively remove undesirable knowledge while differentiating it from retained information, thus providing a mechanism for safely managing unlearned inputs. This strategy seeks to integrate knowledge deletion and refusal for improved LLM unlearning efficacy.

Key facts

Distinguishable Deletion (D²) is a new paradigm for LLM unlearning.
Existing approaches are Knowledge Deletion (KD) and Distinguishable Refusal (DR).
KD erases undesirable information during training.
DR steers models away from using sensitive knowledge during inference.
KD struggles with biased deletion due to suppressing specific token sequences.
DR risks re-emergence of harmful knowledge because underlying knowledge remains intact.
D² restricts response distribution in latent representation rather than specific tokens.
D² enables a refusal mechanism for safe handling of unlearned inputs.

Entities

—

Sources

arXiv cs.AI — 2026-05-19