CLIP Embeddings Drive Memorization in Stable Diffusion

ai-technology · 2026-05-07

A new arXiv paper reveals that memorization in Stable Diffusion is unexpectedly driven by CLIP embeddings. Researchers categorized input tokens as start-of-text, prompt-related, end-of-text, and padding. They found that padding embeddings, which structurally duplicate end-of-text embeddings, amplify the influence of the latter, causing the model to over-rely on it and driving memorization. Prompt-related embeddings contribute minimally in memorized cases.

Key facts

Memorization in Stable Diffusion is driven by CLIP embeddings.
Input tokens are categorized as start-of-text, prompt-related, end-of-text, and padding.
Padding embeddings duplicate end-of-text embeddings structurally.
This duplication amplifies the influence of end-of-text embeddings.
Prompt-related embeddings contribute minimally to memorized cases.
The paper is from arXiv:2605.02908.
The research focuses on text-to-image diffusion models.
The findings have implications for interpretability and safety.

CLIP Embeddings Drive Memorization in Stable Diffusion

Key facts

Entities

Institutions

Sources