Unlearning Depth Score Measures LLM Knowledge Erasure

ai-technology · 2026-05-26

A new metric called the Unlearning Depth Score (UDS) quantifies how deeply knowledge is erased from large language models (LLMs) after unlearning. Existing output-level metrics fail to detect residual knowledge recoverable from internal representations, and white-box methods often require auxiliary training. UDS uses activation patching to identify layers encoding target knowledge via a retain model baseline, then measures erasure on a 0-1 scale. In a meta-evaluation of 20 metrics across 150 unlearned models from 8 methods, UDS achieved the highest faithfulness and robustness. The research is published on arXiv under identifier 2605.24614.

Key facts

UDS stands for Unlearning Depth Score
UDS uses activation patching to measure knowledge erasure depth
It identifies layers encoding target knowledge using a retain model baseline
The metric produces a 0-1 scale score
Meta-evaluation covered 20 metrics on 150 unlearned models
The models spanned 8 different unlearning methods
UDS achieved highest faithfulness and robustness in evaluation
Published on arXiv with ID 2605.24614

Unlearning Depth Score Measures LLM Knowledge Erasure

Key facts

Entities

Institutions

Sources