Cross-Modal Attention Calibration Reduces LVLM Hallucinations

ai-technology · 2026-06-01

A novel approach, known as Cross-Modal Attention Calibration (CMAC), has been developed to minimize hallucinations in large vision-language models (LVLMs). These hallucinations lead to discrepancies between visual data and the text produced. While current inference-time strategies like contrastive decoding tackle the issue of excessive dependence on language priors, they overlook position bias and misleading inter-modality correlations. CMAC features an Inter-Modality Decoding (IMD) module that identifies and masks value vectors linked to substantial cross-modal attention weights as distortions, employing an innovative contrastive decoding technique. This method is elaborated upon in a paper available on arXiv (2501.01926v3) and is designed for complex generation tasks where LVLMs face challenges.

Key facts

CMAC is a training-free method to mitigate LVLM hallucinations.
It addresses position bias and spurious inter-modality correlations.
The Inter-Modality Decoding module masks value vectors with high cross-modal attention.
The paper is available on arXiv with ID 2501.01926v3.
LVLMs suffer from hallucinations in complex generation tasks.
Existing contrastive decoding methods overlook certain hallucination sources.
CMAC uses a novel contrastive decoding mechanism.
The method does not require additional training.

Cross-Modal Attention Calibration Reduces LVLM Hallucinations

Key facts

Entities

Institutions

Sources