ARTFEED — Contemporary Art Intelligence

Visual Agentic Memory Framework for Long Video Understanding

ai-technology · 2026-05-20

Researchers have unveiled a novel framework named Visual Agentic Memory (VAM) aimed at improving how we understand long videos without requiring any prior training. This framework includes three main components: Online Indexing, which helps in retaining selective evidence during streaming; Hierarchical Memory, which structures information in a way that captures both time and space; and Agentic Retrieval, which lets users search and verify potential evidence before producing informed responses. When tested on OVO-Bench, VAM achieved an impressive average score of 68.41 in RT+BT, outperforming the end-to-end version of the same foundational MLLM, Gemini 3 Flash, which scored 67.46. VAM also showed its strength in analyzing long-term videos during a month-long evaluation of MM-Lifelong train@month, totaling 105.6 hours over 51 days.

Key facts

  • VAM is a training-free framework for long video understanding.
  • It includes Online Indexing, Hierarchical Memory, and Agentic Retrieval.
  • On OVO-Bench, VAM achieves RT+BT average of 68.41.
  • Baseline Gemini 3 Flash achieves 67.46 on OVO-Bench.
  • MM-Lifelong train@month split covers 105.6 hours over 51 days.
  • VAM outperforms end-to-end use of the same MLLM.
  • Hierarchical Memory uses Parallel Representation.
  • Agentic Retrieval verifies candidate evidence before answering.

Entities

Institutions

  • arXiv

Sources