MHSA Framework Reduces Hallucinations in Vision-Language Models

ai-technology · 2026-05-16

A new framework called MHSA (Mitigating Hallucinations via Steered Attention) has been developed by researchers to minimize hallucinations in large vision-language models (LVLMs). Hallucinations arise when these models create outputs that do not align with visual data. Unlike earlier efforts, such as DHCP (Detecting Hallucinations by Cross-modal Attention Pattern), which solely focused on detection, MHSA aims to rectify cross-modal attention patterns. It employs a straightforward three-layer MLP generator to generate corrected attention, utilizing guidance from supervisory signals from both the DHCP discriminator and the LVLM. During inference, MHSA effectively reduces both generative and discriminative hallucinations across multiple datasets and LVLMs by substituting the original cross-modal attention. The research can be found on arXiv with the reference 2605.14966.

Key facts

MHSA stands for Mitigating Hallucinations via Steered Attention.
It is a lightweight framework for reducing hallucinations in LVLMs.
Previous work DHCP only detected hallucinations, not mitigated them.
MHSA trains a three-layer MLP generator to correct attention patterns.
Supervisory signals come from the DHCP discriminator and the LVLM.
During inference, MHSA replaces original cross-modal attention.
It addresses both discriminative and generative hallucinations.
The paper is on arXiv with ID 2605.14966.

MHSA Framework Reduces Hallucinations in Vision-Language Models

Key facts

Entities

Institutions

Sources