Dual-Stage Registers Tackle Outlier Tokens in Diffusion Transformers
A new study from arXiv investigates outlier tokens in Diffusion Transformers (DiTs) for image generation. Researchers found that both the encoder and denoiser in modern Representation Autoencoder (RAE)-DiT pipelines produce outlier tokens—high-norm tokens that attract excessive attention while carrying limited local information. This phenomenon, previously observed in Vision Transformers (ViTs), was underexplored in generative models. The team discovered that simply masking high-norm tokens does not improve performance, indicating the issue is tied to corrupted local patch semantics rather than just extreme values. To address this, they propose Dual-Stage Registers (DSR), a register-based intervention for both components. DSR uses trained registers to mitigate outlier tokens, improving model performance. The paper is available on arXiv under ID 2605.05206.
Key facts
- Outlier tokens appear in both encoder and denoiser of RAE-DiT pipelines.
- Masking high-norm tokens does not improve performance.
- Problem is linked to corrupted local patch semantics.
- Dual-Stage Registers (DSR) is proposed as a solution.
- DSR is a register-based intervention for both encoder and denoiser.
- Research is from arXiv preprint 2605.05206.
- Prior work identified outlier tokens in Vision Transformers.
- Study focuses on image generation using Diffusion Transformers.
Entities
Institutions
- arXiv