Persistent Visual Memory Module Enhances LVLM Visual Perception

ai-technology · 2026-05-04

A research paper introduces Persistent Visual Memory (PVM), a lightweight module that addresses the 'Visual Signal Dilution' problem in autoregressive Large Vision-Language Models (LVLMs). In these models, visual attention decays as generated text sequences lengthen. PVM is integrated as a parallel branch alongside the Feed-Forward Network (FFN), creating a distance-agnostic retrieval pathway that supplies visual embeddings directly. This structural intervention mitigates signal suppression during deep generation. Experiments on Qwen3-VL models show consistent accuracy gains with minimal parameter overhead. The paper is available on arXiv under identifier 2605.00814.

Key facts

PVM is a lightweight learnable module for LVLMs.
It addresses 'Visual Signal Dilution' where visual attention decays with generated sequence length.
PVM is integrated as a parallel branch alongside the Feed-Forward Network (FFN).
It establishes a distance-agnostic retrieval pathway for direct visual embeddings.
Experiments were conducted on Qwen3-VL models.
PVM brings notable improvements with negligible parameter overhead.
The paper is published on arXiv with ID 2605.00814.
The module ensures sustained, on-demand visual perception.

Persistent Visual Memory Module Enhances LVLM Visual Perception

Key facts

Entities

Institutions

Sources