MAGIC-Video: Training-Free Framework for Ultra-Long Video Reasoning

ai-technology · 2026-05-12

MAGIC-Video is a novel framework designed to tackle the complexities of analyzing ultra-long videos that can last from days to weeks, such as egocentric videos, live streams, or surveillance recordings. Existing multimodal LLMs, despite having context windows of millions of tokens, can only process minutes of densely sampled footage, leading to the loss of significant information prior to inference. While memory-augmented and agentic methods enhance scalability, they struggle with fragmented retrieval across different modalities and fail to provide comprehensive long-range narrative summaries. MAGIC-Video operates without training, utilizing a multimodal memory graph that incorporates an interleaved narrative chain. This graph connects episodic, semantic, and visual data through six distinct edges, facilitating cross-modal retrieval, while the narrative chain captures long-term entity histories and recurring activities. During inference, an agentic loop combines graph retrieval with narrative fact injection. Further details can be found in arXiv:2605.08271v1.

Key facts

MAGIC-Video is a training-free framework for ultra-long video reasoning.
It addresses videos spanning days to weeks, including egocentric, live stream, and surveillance footage.
Current multimodal LLMs with million-token contexts cover only tens of minutes of dense video.
The framework uses a multimodal memory graph with six typed edges.
It includes an interleaved narrative chain for long-horizon entity biographies and recurring events.
At inference, an agentic loop combines graph retrieval with narrative fact injection.
The paper is available on arXiv with ID 2605.08271v1.
The approach is memory-augmented and agentic.

MAGIC-Video: Training-Free Framework for Ultra-Long Video Reasoning

Key facts

Entities

Institutions

Sources