Architectural Factors Behind LVLM Hallucination Robustness

ai-technology · 2026-06-01

A new study from arXiv (2605.30911) investigates how Large Vision-Language Model (LVLM) architecture design affects hallucination. The authors decompose architecture into Linguistic Foundation, Visual Representation, and Semantic Alignment, and categorize hallucinations into Co-occurrence, Similarity, and Uncertainty types. They introduce the CoSimUE benchmark, which uses controlled textual and random perturbations to create fine-grained hallucination scenarios. Experiments across seven design aspects reveal that scaling model parameters does not consistently reduce hallucinations.

Key facts

Hallucination undermines LVLM reliability.
Architecture design is a key factor in hallucination.
Three dimensions: Linguistic Foundation, Visual Representation, Semantic Alignment.
Three hallucination types: Co-occurrence, Similarity, Uncertainty.
CoSimUE benchmark creates fine-grained scenarios via perturbations.
Experiments cover seven design aspects.
Parameter scaling does not consistently reduce hallucinations.
Study published on arXiv (2605.30911).

Architectural Factors Behind LVLM Hallucination Robustness

Key facts

Entities

Institutions

Sources