Training-Free Method Corrects Drift in Multi-Turn Image Editing with Diffusion Transformers
A significant factor contributing to the decline in quality during multi-turn image editing with diffusion transformers (DiTs) has been uncovered by researchers. Their latent-space frequency analysis revealed that DiTs cause a predominant low-frequency drift that builds up over multiple editing iterations, leading to semantic misalignment. Conversely, the VAE component only adds a consistent reconstruction bias. To tackle this issue, they introduce VAE-LFA (Low Frequency Alignment), a method that requires no training and can be easily integrated. This approach aligns low-frequency statistics in the VAE latent space by utilizing low-pass filtering and an exponential moving average from prior rounds. Further details can be found in arXiv:2605.08250.
Key facts
- Diffusion transformers (DiTs) enable single-turn image editing but suffer from progressive semantic drift in multi-turn editing.
- The drift is caused by dominant low-frequency components introduced by the DiT in the VAE latent space.
- VAE contributes comparatively stable reconstruction bias.
- VAE-LFA is a training-free, plug-and-play method for low-frequency alignment.
- It decomposes latent discrepancies via low-pass filtering and aligns low-frequency statistics to an exponential moving average.
- The research is published on arXiv with ID 2605.08250.
- The method operates in VAE latent space.
- The study uses a frequency perspective to analyze the editing process.
Entities
Institutions
- arXiv