Residualized Temporal SAEs for Interpreting Diffusion Models
A new method called residualized temporal sparse autoencoders (SAEs) is introduced to interpret text-to-image diffusion models. Unlike standard SAEs that analyze activations at individual timesteps, this approach collects activations across the entire denoising trajectory, fits linear predictors between neighboring timesteps, and represents each trajectory using an initial activation plus residual components not explained by linear dynamics. Training an SAE on this residualized representation captures structure beyond linear predictability. The paper is available on arXiv under ID 2605.27813.
Key facts
- Method introduced: residualized temporal sparse autoencoders for diffusion activation trajectories.
- Activations are collected across denoising time.
- Linear predictors are fit between neighboring timesteps.
- Each trajectory is represented by an initial activation and residual components.
- Residual components capture structure not linearly predictable.
- Aimed at interpreting text-to-image diffusion models.
- Paper published on arXiv: 2605.27813.
- Announce type: cross.
Entities
Institutions
- arXiv