Entity-Centric Memory for Consistent Multi-Shot Video Generation

other · 2026-05-25

The innovative technique known as EM-Vid presents an entity-centric memory framework for generating multi-shot videos. This method tackles the issue of ensuring that recurring entities maintain a consistent appearance across different shots while following specific text prompts for each shot. Unlike recent autoregressive techniques that utilize complete frames for memory—resulting in information leakage and increased computational demands—EM-Vid relies on an entity-indexed collection of latent patches. It incorporates sparse token conditioning that aligns with pretrained models, limiting self-attention to tokens relevant to entities to enhance efficiency. The approach also features a structured multi-shot script format, a memory update strategy for compact evolving memory, and a noise-injection mechanism for precise appearance control. This research is available on arXiv with ID 2605.23610.

Key facts

EM-Vid proposes entity-centric memory for multi-shot video generation.
Memory is stored as an entity-indexed bank of latent patches.
Sparse token conditioning reduces computational cost by restricting self-attention to entity-relevant tokens.
A structured multi-shot script format is introduced.
Budgeted memory update strategy maintains compact evolving memory.
Noise-injection mechanism enables fine-grained appearance control.
Method is training-free and compatible with pretrained models.
Paper available on arXiv with ID 2605.23610.

Entity-Centric Memory for Consistent Multi-Shot Video Generation

Key facts

Entities

Institutions

Sources