Unlearning Depth Score Measures LLM Knowledge Erasure
A new metric called the Unlearning Depth Score (UDS) quantifies how deeply knowledge is erased from large language models (LLMs) after unlearning. Existing output-level metrics fail to detect residual knowledge recoverable from internal representations, and white-box methods often require auxiliary training. UDS uses activation patching to identify layers encoding target knowledge via a retain model baseline, then measures erasure on a 0-1 scale. In a meta-evaluation of 20 metrics across 150 unlearned models from 8 methods, UDS achieved the highest faithfulness and robustness. The research is published on arXiv under identifier 2605.24614.
Key facts
- UDS stands for Unlearning Depth Score
- UDS uses activation patching to measure knowledge erasure depth
- It identifies layers encoding target knowledge using a retain model baseline
- The metric produces a 0-1 scale score
- Meta-evaluation covered 20 metrics on 150 unlearned models
- The models spanned 8 different unlearning methods
- UDS achieved highest faithfulness and robustness in evaluation
- Published on arXiv with ID 2605.24614
Entities
Institutions
- arXiv