New Method ILVAD Reduces Hallucination in Large Vision-Language Models
Researchers have identified that Large Vision-Language Models (LVLMs) hallucinate due to insufficient attention to correct visual evidence, which is gradually forgotten during generation. They observed inter-layer visual attention discrepancy, where certain layers show sensitivity to correct evidence. Based on this, they propose ILVAD (Inter-Layer Visual Attention Discrepancy), a method that enhances visual evidence by identifying tokens repeatedly activated across layers. The approach uses attention weights from early generated tokens to visual tokens. This work is published on arXiv with ID 2605.20965.
Key facts
- LVLMs hallucinate when they pay insufficient attention to correct visual evidence.
- LVLMs gradually forget visual evidence during generation.
- Specific layers exhibit sensitivity to correct visual evidence with inter-layer discrepancy.
- ILVAD enhances visual evidence based on inter-layer visual attention discrepancy.
- Attention weights from early generated tokens to visual tokens are used.
- Tokens repeatedly activated across layers are identified.
- The method aims to mitigate hallucination in LVLMs.
- The paper is available on arXiv.
Entities
Institutions
- arXiv