Head Forcing Extends Autoregressive Video Generation to Minute-Length

ai-technology · 2026-05-16

A novel framework, named Head Forcing, allows autoregressive video diffusion models to produce minute-long videos without experiencing error accumulation or losing context. This method, detailed in a preprint on arXiv (2605.14487), tackles the uniform approach to attention heads in current AR video diffusion transformers. Researchers found that attention heads fulfill different roles: local heads enhance detail, anchor heads maintain structural integrity, and memory heads aggregate long-range context. Head Forcing implements a specific KV cache strategy for each head type—local and anchor heads keep only crucial tokens, while memory heads utilize a hierarchical memory system with dynamic episodic updates. Additionally, a head-wise RoPE re-encoding approach ensures positional encodings stay within the pretrained range, extending video generation from 5 seconds to a full minute without extra training, thus greatly enhancing long-horizon video synthesis.

Key facts

Head Forcing is a training-free framework for autoregressive video diffusion models.
It addresses error accumulation and context loss in long-horizon video generation.
Attention heads are categorized as local, anchor, and memory heads with distinct roles.
Local and anchor heads retain only essential tokens in KV cache.
Memory heads use a hierarchical memory system with dynamic episodic updates.
A head-wise RoPE re-encoding scheme keeps positional encodings within pretrained range.
Generation duration extends from 5 seconds to minute-level without additional training.
The method is described in arXiv preprint 2605.14487.

Head Forcing Extends Autoregressive Video Generation to Minute-Length

Key facts

Entities

Institutions

Sources