ARTFEED — Contemporary Art Intelligence

Study Questions Whether Vision-Language Benchmarks Truly Test Visual Understanding

ai-technology · 2026-05-25

A recent study published on arXiv (2605.22903) questions the belief that exceptional benchmark accuracy in vision-language models (VLMs) signifies true visual comprehension. The researchers found that performance on a prominent hallucination benchmark was only minimally affected when a significant portion of image tokens was removed. Their analysis includes global visual degradation, localized occlusion, reformulating questions, expanding answer space, and decision-level evaluations. Additionally, a layer-wise examination of vision-token geometry supports the behavioral findings. The results reveal that while VLMs do utilize visual input, they are not as responsive to the absence of detailed visual information as accuracy metrics imply.

Key facts

  • Study published on arXiv with ID 2605.22903
  • Focuses on vision-language models (VLMs)
  • Removing many image tokens barely affects benchmark scores
  • Analysis includes global and localized visual degradation
  • Examines question reformulation and answer-space expansion
  • Layer-wise analysis of vision-token geometry conducted
  • VLMs still use visual input but not as much as assumed
  • Benchmark accuracy may overstate visual grounding

Entities

Institutions

  • arXiv

Sources