Head Forcing Extends Autoregressive Video Generation to Minute-Length
A novel framework, named Head Forcing, allows autoregressive video diffusion models to produce minute-long videos without experiencing error accumulation or losing context. This method, detailed in a preprint on arXiv (2605.14487), tackles the uniform approach to attention heads in current AR video diffusion transformers. Researchers found that attention heads fulfill different roles: local heads enhance detail, anchor heads maintain structural integrity, and memory heads aggregate long-range context. Head Forcing implements a specific KV cache strategy for each head type—local and anchor heads keep only crucial tokens, while memory heads utilize a hierarchical memory system with dynamic episodic updates. Additionally, a head-wise RoPE re-encoding approach ensures positional encodings stay within the pretrained range, extending video generation from 5 seconds to a full minute without extra training, thus greatly enhancing long-horizon video synthesis.
Key facts
- Head Forcing is a training-free framework for autoregressive video diffusion models.
- It addresses error accumulation and context loss in long-horizon video generation.
- Attention heads are categorized as local, anchor, and memory heads with distinct roles.
- Local and anchor heads retain only essential tokens in KV cache.
- Memory heads use a hierarchical memory system with dynamic episodic updates.
- A head-wise RoPE re-encoding scheme keeps positional encodings within pretrained range.
- Generation duration extends from 5 seconds to minute-level without additional training.
- The method is described in arXiv preprint 2605.14487.
Entities
Institutions
- arXiv