SafeRedir: A Lightweight Framework for Robust Unlearning in Image Generation Models

ai-technology · 2026-05-07

Researchers have introduced SafeRedir, a lightweight inference-time framework designed to erase harmful concepts from image generation models (IGMs) without costly retraining. IGMs often memorize undesirable content from training data, such as NSFW imagery and copyrighted artistic styles, posing safety and compliance risks. Post-hoc filtering methods lack robustness and fine-grained semantic control. Existing unlearning methods require retraining, degrade generation quality, or fail against paraphrasing and adversarial attacks. SafeRedir operates via prompt embedding redirection, modifying the model's behavior at inference time to prevent the reproduction of unsafe content while preserving benign generation quality. The framework does not alter the underlying model weights, making it efficient and adaptable. This approach addresses the need for robust, scalable unlearning in real-world deployments of generative AI.

Key facts

SafeRedir is a lightweight inference-time framework for unlearning in image generation models.
It uses prompt embedding redirection to erase harmful concepts without retraining.
IGMs often memorize NSFW imagery and copyrighted styles from training data.
Post-hoc filtering is not robust and lacks fine-grained semantic control.
Existing unlearning methods require costly retraining or degrade generation quality.
SafeRedir does not modify model weights, preserving benign generation quality.
The framework is designed to withstand prompt paraphrasing and adversarial attacks.
SafeRedir addresses safety and compliance risks in real-world deployments.

Entities

—

Sources

arXiv cs.AI — 2026-05-07