SAEmnesia: Supervised Sparse Autoencoders for Concept Unlearning in Diffusion Models

ai-technology · 2026-06-01

A new framework called SAEmnesia has been introduced by researchers, designed to tackle concept unlearning in diffusion models through the implementation of one-to-one mappings between concepts and neurons. This innovative approach effectively addresses the issue of feature splitting, which complicates the removal of concepts that are scattered across numerous latent features. By meticulously labeling concepts throughout the training process, SAEmnesia achieves a centralization of features, linking each concept to a distinct, interpretable neuron. When compared to leading sparse autoencoder unlearning methods, SAEmnesia minimizes hyperparameter search efforts by 96.67% and enhances the UnlearnCanvas benchmark for objects by 9.22%. Additionally, it demonstrates remarkable scalability in sequential unlearning, boosting accuracy by 28.4% when eliminating nine objects, marking significant progress in controlled concept deletion within diffusion models.

Key facts

SAEmnesia is a supervised sparse autoencoder framework for concept unlearning in diffusion models.
It enforces one-to-one concept-neuron mappings to overcome feature splitting.
The method reduces hyperparameter search by 96.67% compared to state-of-the-art sparse autoencoder-based unlearning.
SAEmnesia achieves a 9.22% improvement on the UnlearnCanvas benchmark for objects.
It improves accuracy by 28.4% when sequentially removing nine objects.
The framework enables highly targeted and efficient concept erasure.
Concept labeling during training achieves feature centralization.
The work is published on arXiv under identifier 2509.21379.

SAEmnesia: Supervised Sparse Autoencoders for Concept Unlearning in Diffusion Models

Key facts

Entities

Institutions

Sources