Entity-Centric Memory for Consistent Multi-Shot Video Generation
The innovative technique known as EM-Vid presents an entity-centric memory framework for generating multi-shot videos. This method tackles the issue of ensuring that recurring entities maintain a consistent appearance across different shots while following specific text prompts for each shot. Unlike recent autoregressive techniques that utilize complete frames for memory—resulting in information leakage and increased computational demands—EM-Vid relies on an entity-indexed collection of latent patches. It incorporates sparse token conditioning that aligns with pretrained models, limiting self-attention to tokens relevant to entities to enhance efficiency. The approach also features a structured multi-shot script format, a memory update strategy for compact evolving memory, and a noise-injection mechanism for precise appearance control. This research is available on arXiv with ID 2605.23610.
Key facts
- EM-Vid proposes entity-centric memory for multi-shot video generation.
- Memory is stored as an entity-indexed bank of latent patches.
- Sparse token conditioning reduces computational cost by restricting self-attention to entity-relevant tokens.
- A structured multi-shot script format is introduced.
- Budgeted memory update strategy maintains compact evolving memory.
- Noise-injection mechanism enables fine-grained appearance control.
- Method is training-free and compatible with pretrained models.
- Paper available on arXiv with ID 2605.23610.
Entities
Institutions
- arXiv