BEAP Attack Exploits Unlearned Concepts in Text-to-Image Diffusion Models
A new adversarial prompting attack known as BEAP has been developed by researchers, targeting machine unlearning weaknesses in text-to-image diffusion models. Unlike previous methods that necessitated access to model weights or resulted in nonsensical prompts, BEAP leverages a large language model (LLM) to create effective adversarial prompts through iterative generation. It conducts an embedding-aware search within the text domain, integrating reward signals for the absence of unlearned concepts, text-image coherence, and image quality. This attack focuses on models that have undergone unlearning to eliminate certain concepts, exposing underlying vulnerabilities. The findings are detailed in a paper available on arXiv with the identifier 2605.26332.
Key facts
- BEAP is a black-box, embedding-aware adversarial prompting attack.
- It leverages a large language model (LLM) to generate adversarial prompts.
- The attack combines reward signals: unlearned concept presence, text-image alignment, and image quality.
- Prior attacks either require model weights or produce detectable gibberish prompts.
- The paper is on arXiv with ID 2605.26332.
- Machine unlearning aims to remove specific concepts from pretrained models.
- BEAP exploits vulnerabilities in unlearned text-to-image diffusion models.
- The attack performs an embedding-aware search in text space.
Entities
Institutions
- arXiv