Cross-Modal Attention Calibration Reduces LVLM Hallucinations
A novel approach, known as Cross-Modal Attention Calibration (CMAC), has been developed to minimize hallucinations in large vision-language models (LVLMs). These hallucinations lead to discrepancies between visual data and the text produced. While current inference-time strategies like contrastive decoding tackle the issue of excessive dependence on language priors, they overlook position bias and misleading inter-modality correlations. CMAC features an Inter-Modality Decoding (IMD) module that identifies and masks value vectors linked to substantial cross-modal attention weights as distortions, employing an innovative contrastive decoding technique. This method is elaborated upon in a paper available on arXiv (2501.01926v3) and is designed for complex generation tasks where LVLMs face challenges.
Key facts
- CMAC is a training-free method to mitigate LVLM hallucinations.
- It addresses position bias and spurious inter-modality correlations.
- The Inter-Modality Decoding module masks value vectors with high cross-modal attention.
- The paper is available on arXiv with ID 2501.01926v3.
- LVLMs suffer from hallucinations in complex generation tasks.
- Existing contrastive decoding methods overlook certain hallucination sources.
- CMAC uses a novel contrastive decoding mechanism.
- The method does not require additional training.
Entities
Institutions
- arXiv