Machine Unlearning Reliability Gap in Language Models
A new investigation available on arXiv examines the reliability paradox within the realm of machine unlearning, specifically focusing on generative language models. The findings indicate a discrepancy between calibration error and the actual reliability of model decisions. Fine-tuned models exhibit a much lower expected calibration error (ECE ~ 0.04) compared to their pretrained counterparts (ECE > 0.5). Despite this lower error rate, the researchers found that such calibration does not always ensure accurate or reliable outcomes, as models may still make decisions based on incorrect correlations. Various evaluation methods, including the TOFU benchmark, were employed to assess this phenomenon.
Key facts
- arXiv paper 2605.20915 examines machine unlearning in language models.
- Calibration error is used as a proxy for reliability but can be misleading.
- Fine-tuned models have ECE ~ 0.04, pretrained models have ECE > 0.5.
- Low calibration error does not imply reliable decision rules.
- Models may rely on spurious correlations while remaining well calibrated.
- Study uses TOFU benchmark and multiple-choice question-answering protocol.
- Probabilistic reliability measured with ECE, MCE, Brier.
- Decision-rule reliability measured via Integrated Gradients and Local Mutual Information.
Entities
Institutions
- arXiv