VisualNeedle Benchmark Exposes MLLM Visual Search Shortcuts
Researchers have introduced VisualNeedle, a benchmark designed to test active visual search in multimodal large language models (MLLMs). The benchmark targets scenes where critical evidence is spatially constrained and information-dense, challenging models to rely on genuine visual processing rather than shortcuts. Prior studies identified three common shortcuts: linguistic priors and lexical cues in questions, coarse global semantics from visual encoders bypassing fine-grained details, and corruption of intermediate images barely affecting answers in some benchmarks. VisualNeedle aims to address these issues by requiring fine-grained perception beyond high resolution or large question pools. The work is published on arXiv under identifier 2605.26380.
Key facts
- VisualNeedle is a benchmark for active visual search in information-dense scenes.
- Frontier MLLMs achieve over 90% accuracy on fine-grained perception benchmarks.
- Three shortcuts inflate benchmark performance: linguistic priors, coarse global semantics, and robustness to image corruption.
- Higher input resolution or larger question pools do not elicit genuine active visual search.
- The benchmark is introduced to address the lack of faithful use of visual evidence.
- The paper is available on arXiv with ID 2605.26380.
- The research focuses on spatially constrained critical evidence.
- The benchmark is designed to be challenging and fine-grained.
Entities
Institutions
- arXiv