FreqFormer: Efficient Video Diffusion via Adaptive Spectral Attention

ai-technology · 2026-04-29

FreqFormer is an innovative technique that tackles the quadratic self-attention expense associated with long-sequence video diffusion transformers. This strategy divides token features into different spectral bands: it applies dense global attention on compressed low frequencies to capture layout and coarse motion, utilizes structured block-sparse attention on mid frequencies, and employs sliding-window local attention on high frequencies for texture and detail. A lightweight spectral routing network manages the allocation of attention heads across bands according to layer statistics and diffusion timesteps, transitioning computational focus from global structure during initial denoising to finer details later. Additionally, cross-band summary tokens facilitate efficient residual exchanges. This method aims to enhance runtime and memory efficiency for extremely lengthy token sequences.

Key facts

FreqFormer is a frequency-aware heterogeneous attention framework.
It splits token features into spectral bands with different operators.
Low frequencies get dense global attention; mid frequencies get structured block-sparse attention; high frequencies get sliding-window local attention.
A lightweight spectral routing network allocates heads across bands using layer statistics and diffusion timestep.
Compute shifts toward global structure early in denoising and detail later.
Cross-band summary tokens provide cheap residual exchange.
The method targets long-sequence video diffusion transformers.
It aims to reduce quadratic self-attention cost in runtime and memory.

FreqFormer: Efficient Video Diffusion via Adaptive Spectral Attention

Key facts

Entities

Institutions

Sources