Visual Agentic Memory Framework for Long Video Understanding

ai-technology · 2026-05-20

Researchers have unveiled a novel framework named Visual Agentic Memory (VAM) aimed at improving how we understand long videos without requiring any prior training. This framework includes three main components: Online Indexing, which helps in retaining selective evidence during streaming; Hierarchical Memory, which structures information in a way that captures both time and space; and Agentic Retrieval, which lets users search and verify potential evidence before producing informed responses. When tested on OVO-Bench, VAM achieved an impressive average score of 68.41 in RT+BT, outperforming the end-to-end version of the same foundational MLLM, Gemini 3 Flash, which scored 67.46. VAM also showed its strength in analyzing long-term videos during a month-long evaluation of MM-Lifelong train@month, totaling 105.6 hours over 51 days.

Key facts

VAM is a training-free framework for long video understanding.
It includes Online Indexing, Hierarchical Memory, and Agentic Retrieval.
On OVO-Bench, VAM achieves RT+BT average of 68.41.
Baseline Gemini 3 Flash achieves 67.46 on OVO-Bench.
MM-Lifelong train@month split covers 105.6 hours over 51 days.
VAM outperforms end-to-end use of the same MLLM.
Hierarchical Memory uses Parallel Representation.
Agentic Retrieval verifies candidate evidence before answering.

Visual Agentic Memory Framework for Long Video Understanding

Key facts

Entities

Institutions

Sources