Visual Agentic Memory Framework for Long Video Understanding
Researchers have unveiled a novel framework named Visual Agentic Memory (VAM) aimed at improving how we understand long videos without requiring any prior training. This framework includes three main components: Online Indexing, which helps in retaining selective evidence during streaming; Hierarchical Memory, which structures information in a way that captures both time and space; and Agentic Retrieval, which lets users search and verify potential evidence before producing informed responses. When tested on OVO-Bench, VAM achieved an impressive average score of 68.41 in RT+BT, outperforming the end-to-end version of the same foundational MLLM, Gemini 3 Flash, which scored 67.46. VAM also showed its strength in analyzing long-term videos during a month-long evaluation of MM-Lifelong train@month, totaling 105.6 hours over 51 days.
Key facts
- VAM is a training-free framework for long video understanding.
- It includes Online Indexing, Hierarchical Memory, and Agentic Retrieval.
- On OVO-Bench, VAM achieves RT+BT average of 68.41.
- Baseline Gemini 3 Flash achieves 67.46 on OVO-Bench.
- MM-Lifelong train@month split covers 105.6 hours over 51 days.
- VAM outperforms end-to-end use of the same MLLM.
- Hierarchical Memory uses Parallel Representation.
- Agentic Retrieval verifies candidate evidence before answering.
Entities
Institutions
- arXiv