Massive Activations in Diffusion Transformers Reveal How Prompts Shape Images

ai-technology · 2026-05-16

A new study from arXiv (2605.13974) reveals that in Diffusion Transformers (DiTs) and flow-based architectures, a small subset of hidden-state channels—termed 'massive activations'—are responsible for drawing the whole picture. Despite their sparsity, these channels are functionally critical: zeroing them causes a sharp collapse in generation quality, while disrupting low-statistic channels has marginal effect. They are spatially organized, with image-stream tokens clustering into coherent partitions that align with main subjects and salient regions, exposing structured spatial layouts. The findings shed light on the internal mechanisms of text-to-image generation.

Key facts

Study focuses on Diffusion Transformers (DiTs) and flow-based architectures
Massive activations are a small subset of hidden-state channels with consistently larger responses
Zeroing massive channels causes sharp collapse in generation quality
Disrupting low-statistic channels has marginal effect
Massive channels are spatially organized
Image-stream tokens cluster into coherent partitions aligning with main subjects and salient regions
Research exposes structured spatial layouts in DiTs
Paper available on arXiv with ID 2605.13974

Massive Activations in Diffusion Transformers Reveal How Prompts Shape Images

Key facts

Entities

Institutions

Sources