Attention Distraction Causes Hallucinations in MLLMs, New Algorithm Corrects It
A recent study published on arXiv indicates a connection between object hallucinations in multimodal large language models (MLLMs) and a phenomenon of attention distraction similar to that observed in humans. The researchers demonstrate that when attention is divided, humans suffer from reduced visual clarity and erroneous descriptions, while MLLMs show inconsistencies in spatial attention across multiple heads and a temporal decline in focus on image tokens during decoding. Theoretical findings suggest that such attention dispersion complicates models and undermines their classification generalization. To mitigate this issue, they introduce the Attention-Focused Approach for Improved Image Perception (AFIP), which enhances attention through cross-head enrichment and strengthens visual grounding with dynamic historical attention improvements.
Key facts
- Paper published on arXiv with ID 2605.24602
- Reveals link between object hallucinations in MLLMs and attention distraction
- Attention distraction causes spatial inconsistency in multi-head attention
- Temporal fading of attention to image tokens occurs during decoding
- Attention dispersion increases model complexity and degrades classification generalization
- Proposes AFIP algorithm to correct attention distraction
- AFIP uses cross-head attention enrichment and dynamic historical attention enhancement
Entities
Institutions
- arXiv