CLIP Embeddings Drive Memorization in Stable Diffusion
A new arXiv paper reveals that memorization in Stable Diffusion is unexpectedly driven by CLIP embeddings. Researchers categorized input tokens as start-of-text, prompt-related, end-of-text, and padding. They found that padding embeddings, which structurally duplicate end-of-text embeddings, amplify the influence of the latter, causing the model to over-rely on it and driving memorization. Prompt-related embeddings contribute minimally in memorized cases.
Key facts
- Memorization in Stable Diffusion is driven by CLIP embeddings.
- Input tokens are categorized as start-of-text, prompt-related, end-of-text, and padding.
- Padding embeddings duplicate end-of-text embeddings structurally.
- This duplication amplifies the influence of end-of-text embeddings.
- Prompt-related embeddings contribute minimally to memorized cases.
- The paper is from arXiv:2605.02908.
- The research focuses on text-to-image diffusion models.
- The findings have implications for interpretability and safety.
Entities
Institutions
- arXiv