ARTFEED — Contemporary Art Intelligence

BEAP Attack Exploits Unlearned Concepts in Text-to-Image Diffusion Models

ai-technology · 2026-05-27

A new adversarial prompting attack known as BEAP has been developed by researchers, targeting machine unlearning weaknesses in text-to-image diffusion models. Unlike previous methods that necessitated access to model weights or resulted in nonsensical prompts, BEAP leverages a large language model (LLM) to create effective adversarial prompts through iterative generation. It conducts an embedding-aware search within the text domain, integrating reward signals for the absence of unlearned concepts, text-image coherence, and image quality. This attack focuses on models that have undergone unlearning to eliminate certain concepts, exposing underlying vulnerabilities. The findings are detailed in a paper available on arXiv with the identifier 2605.26332.

Key facts

  • BEAP is a black-box, embedding-aware adversarial prompting attack.
  • It leverages a large language model (LLM) to generate adversarial prompts.
  • The attack combines reward signals: unlearned concept presence, text-image alignment, and image quality.
  • Prior attacks either require model weights or produce detectable gibberish prompts.
  • The paper is on arXiv with ID 2605.26332.
  • Machine unlearning aims to remove specific concepts from pretrained models.
  • BEAP exploits vulnerabilities in unlearned text-to-image diffusion models.
  • The attack performs an embedding-aware search in text space.

Entities

Institutions

  • arXiv

Sources