Architectural Factors Behind LVLM Hallucination Robustness
A new study from arXiv (2605.30911) investigates how Large Vision-Language Model (LVLM) architecture design affects hallucination. The authors decompose architecture into Linguistic Foundation, Visual Representation, and Semantic Alignment, and categorize hallucinations into Co-occurrence, Similarity, and Uncertainty types. They introduce the CoSimUE benchmark, which uses controlled textual and random perturbations to create fine-grained hallucination scenarios. Experiments across seven design aspects reveal that scaling model parameters does not consistently reduce hallucinations.
Key facts
- Hallucination undermines LVLM reliability.
- Architecture design is a key factor in hallucination.
- Three dimensions: Linguistic Foundation, Visual Representation, Semantic Alignment.
- Three hallucination types: Co-occurrence, Similarity, Uncertainty.
- CoSimUE benchmark creates fine-grained scenarios via perturbations.
- Experiments cover seven design aspects.
- Parameter scaling does not consistently reduce hallucinations.
- Study published on arXiv (2605.30911).
Entities
Institutions
- arXiv