DySink: Dynamic Frame Sinks for Efficient Long Video Generation
DySink, a novel framework, enhances the generation of long videos in an autoregressive manner by substituting static early-frame sinks with dynamic, retrieval-based alternatives. Conventional techniques rely on unchanging early frames as long-range references, which can become obsolete as the visual context shifts, leading to bias and possible sink failure. In contrast, DySink utilizes a streamlined memory bank to dynamically choose visually pertinent historical frames, along with a sink anomaly gate that identifies excessive consensus in inter-head attention. This flexible method significantly improves both the quality and efficiency of video generation.
Key facts
- DySink is a retrieval-based framework for autoregressive long video generation.
- It replaces static early-frame sinks with dynamic frame sinks.
- Traditional methods use fixed early frames that become outdated.
- Static sinks can cause bias and sink collapse due to RoPE-induced phase re-alignment.
- DySink maintains a compact memory bank.
- It selects visually relevant historical frames adaptively.
- A sink anomaly gate detects excessive inter-head consensus.
- The framework improves generation quality and efficiency.
Entities
Institutions
- arXiv