VisualNeedle Benchmark Exposes MLLM Visual Search Shortcuts

ai-technology · 2026-05-27

Researchers have introduced VisualNeedle, a benchmark designed to test active visual search in multimodal large language models (MLLMs). The benchmark targets scenes where critical evidence is spatially constrained and information-dense, challenging models to rely on genuine visual processing rather than shortcuts. Prior studies identified three common shortcuts: linguistic priors and lexical cues in questions, coarse global semantics from visual encoders bypassing fine-grained details, and corruption of intermediate images barely affecting answers in some benchmarks. VisualNeedle aims to address these issues by requiring fine-grained perception beyond high resolution or large question pools. The work is published on arXiv under identifier 2605.26380.

Key facts

VisualNeedle is a benchmark for active visual search in information-dense scenes.
Frontier MLLMs achieve over 90% accuracy on fine-grained perception benchmarks.
Three shortcuts inflate benchmark performance: linguistic priors, coarse global semantics, and robustness to image corruption.
Higher input resolution or larger question pools do not elicit genuine active visual search.
The benchmark is introduced to address the lack of faithful use of visual evidence.
The paper is available on arXiv with ID 2605.26380.
The research focuses on spatially constrained critical evidence.
The benchmark is designed to be challenging and fine-grained.

VisualNeedle Benchmark Exposes MLLM Visual Search Shortcuts

Key facts

Entities

Institutions

Sources