Massive Activations in Diffusion Transformers Reveal How Prompts Shape Images
A new study from arXiv (2605.13974) reveals that in Diffusion Transformers (DiTs) and flow-based architectures, a small subset of hidden-state channels—termed 'massive activations'—are responsible for drawing the whole picture. Despite their sparsity, these channels are functionally critical: zeroing them causes a sharp collapse in generation quality, while disrupting low-statistic channels has marginal effect. They are spatially organized, with image-stream tokens clustering into coherent partitions that align with main subjects and salient regions, exposing structured spatial layouts. The findings shed light on the internal mechanisms of text-to-image generation.
Key facts
- Study focuses on Diffusion Transformers (DiTs) and flow-based architectures
- Massive activations are a small subset of hidden-state channels with consistently larger responses
- Zeroing massive channels causes sharp collapse in generation quality
- Disrupting low-statistic channels has marginal effect
- Massive channels are spatially organized
- Image-stream tokens cluster into coherent partitions aligning with main subjects and salient regions
- Research exposes structured spatial layouts in DiTs
- Paper available on arXiv with ID 2605.13974
Entities
Institutions
- arXiv