Visual Attention Structure Reveals Hallucination in Multimodal LLMs

ai-technology · 2026-05-13

A new technique has been discovered by researchers for identifying visual hallucinations in multimodal large language models (MLLMs) through the examination of the high-frequency patterns in visual attention. The findings, detailed in a study on arXiv, indicate that the layer-wise Laplacian energy can reveal the origins of hallucinated preferences and the moments when accurate answers briefly reappear. They introduce LaSCD (Laplacian-Spectral Contrastive Decoding), a decoding method that does not require training, which utilizes Laplacian energy to choose relevant layers and reformulates next-token logits mathematically. This method tackles the issue that hallucinations can occur even when models allocate significant attention to image tokens but still veer towards incorrect responses. The paper can be accessed at arXiv:2605.11559.

Key facts

Multimodal large language models (MLLMs) are vulnerable to visual hallucinations.
Hallucination can occur even when models assign substantial attention to image tokens.
High-frequency structure of visual attention, measured by layer-wise Laplacian energy, reveals hallucination layers.
LaSCD (Laplacian-Spectral Contrastive Decoding) is a training-free decoding strategy.
LaSCD selects informative layers via Laplacian energy and remaps next-token logits in closed form.
The paper is published on arXiv with identifier 2605.11559.
The study focuses on visual reasoning and grounded question answering.
The method can detect where ground-truth answers transiently recover.

Visual Attention Structure Reveals Hallucination in Multimodal LLMs

Key facts

Entities

Institutions

Sources