ARTFEED — Contemporary Art Intelligence

Unlearning Depth Score Measures LLM Knowledge Erasure

ai-technology · 2026-05-26

A new metric called the Unlearning Depth Score (UDS) quantifies how deeply knowledge is erased from large language models (LLMs) after unlearning. Existing output-level metrics fail to detect residual knowledge recoverable from internal representations, and white-box methods often require auxiliary training. UDS uses activation patching to identify layers encoding target knowledge via a retain model baseline, then measures erasure on a 0-1 scale. In a meta-evaluation of 20 metrics across 150 unlearned models from 8 methods, UDS achieved the highest faithfulness and robustness. The research is published on arXiv under identifier 2605.24614.

Key facts

  • UDS stands for Unlearning Depth Score
  • UDS uses activation patching to measure knowledge erasure depth
  • It identifies layers encoding target knowledge using a retain model baseline
  • The metric produces a 0-1 scale score
  • Meta-evaluation covered 20 metrics on 150 unlearned models
  • The models spanned 8 different unlearning methods
  • UDS achieved highest faithfulness and robustness in evaluation
  • Published on arXiv with ID 2605.24614

Entities

Institutions

  • arXiv

Sources