Persistent Visual Memory Module Enhances LVLM Visual Perception
A research paper introduces Persistent Visual Memory (PVM), a lightweight module that addresses the 'Visual Signal Dilution' problem in autoregressive Large Vision-Language Models (LVLMs). In these models, visual attention decays as generated text sequences lengthen. PVM is integrated as a parallel branch alongside the Feed-Forward Network (FFN), creating a distance-agnostic retrieval pathway that supplies visual embeddings directly. This structural intervention mitigates signal suppression during deep generation. Experiments on Qwen3-VL models show consistent accuracy gains with minimal parameter overhead. The paper is available on arXiv under identifier 2605.00814.
Key facts
- PVM is a lightweight learnable module for LVLMs.
- It addresses 'Visual Signal Dilution' where visual attention decays with generated sequence length.
- PVM is integrated as a parallel branch alongside the Feed-Forward Network (FFN).
- It establishes a distance-agnostic retrieval pathway for direct visual embeddings.
- Experiments were conducted on Qwen3-VL models.
- PVM brings notable improvements with negligible parameter overhead.
- The paper is published on arXiv with ID 2605.00814.
- The module ensures sustained, on-demand visual perception.
Entities
Institutions
- arXiv