HalluScope Benchmark Reveals Textual Priors as Main Cause of LVLM Hallucinations
A new study from arXiv (2604.21911v1) introduces HalluScope, a benchmark designed to identify the primary causes of hallucinations in large vision-language models (LVLMs). The research finds that hallucinations are largely driven by excessive reliance on textual priors and background knowledge, particularly information introduced through textual instructions, rather than limitations of the vision backbone or language component dominance. To address this, the authors propose HalluVL-DPO, a fine-tuning framework that steers off-the-shelf LVLMs toward more visually grounded responses using preference optimization. The work provides a systematic analysis of hallucination factors and offers a mitigation strategy.
Key facts
- HalluScope benchmark proposed to understand factors inducing hallucinations in LVLMs
- Hallucinations stem from excessive reliance on textual priors and background knowledge
- Textual instructions are a key source of hallucination-inducing priors
- HalluVL-DPO framework fine-tunes LVLMs for visually grounded responses
- HalluVL-DPO leverages preference optimization
- Study published on arXiv with identifier 2604.21911v1
- Research resolves ambiguity about relative importance of hallucination factors
- Prior work attributed hallucinations to vision backbone limitations or language dominance
Entities
Institutions
- arXiv