Study Questions Whether Vision-Language Benchmarks Truly Test Visual Understanding

ai-technology · 2026-05-25

A recent study published on arXiv (2605.22903) questions the belief that exceptional benchmark accuracy in vision-language models (VLMs) signifies true visual comprehension. The researchers found that performance on a prominent hallucination benchmark was only minimally affected when a significant portion of image tokens was removed. Their analysis includes global visual degradation, localized occlusion, reformulating questions, expanding answer space, and decision-level evaluations. Additionally, a layer-wise examination of vision-token geometry supports the behavioral findings. The results reveal that while VLMs do utilize visual input, they are not as responsive to the absence of detailed visual information as accuracy metrics imply.

Key facts

Study published on arXiv with ID 2605.22903
Focuses on vision-language models (VLMs)
Removing many image tokens barely affects benchmark scores
Analysis includes global and localized visual degradation
Examines question reformulation and answer-space expansion
Layer-wise analysis of vision-token geometry conducted
VLMs still use visual input but not as much as assumed
Benchmark accuracy may overstate visual grounding

Study Questions Whether Vision-Language Benchmarks Truly Test Visual Understanding

Key facts

Entities

Institutions

Sources