ARTFEED — Contemporary Art Intelligence

Cross-Modal Attention Calibration Reduces LVLM Hallucinations

ai-technology · 2026-06-01

A novel approach, known as Cross-Modal Attention Calibration (CMAC), has been developed to minimize hallucinations in large vision-language models (LVLMs). These hallucinations lead to discrepancies between visual data and the text produced. While current inference-time strategies like contrastive decoding tackle the issue of excessive dependence on language priors, they overlook position bias and misleading inter-modality correlations. CMAC features an Inter-Modality Decoding (IMD) module that identifies and masks value vectors linked to substantial cross-modal attention weights as distortions, employing an innovative contrastive decoding technique. This method is elaborated upon in a paper available on arXiv (2501.01926v3) and is designed for complex generation tasks where LVLMs face challenges.

Key facts

  • CMAC is a training-free method to mitigate LVLM hallucinations.
  • It addresses position bias and spurious inter-modality correlations.
  • The Inter-Modality Decoding module masks value vectors with high cross-modal attention.
  • The paper is available on arXiv with ID 2501.01926v3.
  • LVLMs suffer from hallucinations in complex generation tasks.
  • Existing contrastive decoding methods overlook certain hallucination sources.
  • CMAC uses a novel contrastive decoding mechanism.
  • The method does not require additional training.

Entities

Institutions

  • arXiv

Sources