New AI Architecture MM-Mem Uses Semantic Information Bottleneck for Long-Horizon Video Understanding
A new pyramidal multimodal memory architecture, named MM-Mem, has been developed by researchers to overcome challenges in long-horizon video comprehension using multimodal large language models. This innovative system organizes memory into three hierarchical components: Sensory Buffer, Episodic Stream, and Symbolic Schema. This structure facilitates the transformation of detailed perceptual information into overarching semantic schemas, progressing from precise details to a general understanding. Grounded in Fuzzy-Trace Theory, the architecture incorporates a Semantic Information Bottleneck that dynamically shapes memory. While current multimodal models excel in short-term reasoning, they falter in long-term video analysis due to limited context windows and inflexible memory systems. Existing strategies either rely too heavily on visual data, causing latency and redundancy, or focus on text, leading to detail loss and hallucinations. MM-Mem seeks to reconcile these issues. This research was published on arXiv with the identifier arXiv:2603.01455v3, categorized as replace-cross.
Key facts
- MM-Mem is a pyramidal multimodal memory architecture for long-horizon video understanding
- The architecture structures memory hierarchically into Sensory Buffer, Episodic Stream, and Symbolic Schema
- It enables progressive distillation from fine-grained perceptual traces to high-level semantic schemas
- The system is grounded in Fuzzy-Trace Theory
- A Semantic Information Bottleneck governs dynamic memory construction
- Multimodal large language models struggle with long-horizon video understanding due to limited context windows
- Existing methods fall into vision-centric (high latency) or text-centric (detail loss) extremes
- Research was published on arXiv under identifier arXiv:2603.01455v3
Entities
Institutions
- arXiv