ARTFEED — Contemporary Art Intelligence

Counterfactual Semantic Salience Reveals VLM-Human Scene Perception Gap

ai-technology · 2026-05-14

A new black-box framework, Counterfactual Semantic Saliency (CSS), quantifies how vision-language models (VLMs) differ from humans in scene understanding. The method measures object importance by causally removing it from a scene and tracking semantic shift. Testing on 307 natural scenes with 1,306 counterfactual variants and 16,289 human responses, researchers found VLMs over-rely on large objects, center placement, and high salience compared to humans. The study highlights a pervasive comprehension gap in AI-human semantic alignment.

Key facts

  • Counterfactual Semantic Saliency (CSS) is a black-box, model-agnostic framework.
  • CSS measures object importance via causal ablation and semantic shift.
  • Tested on 307 complex natural scenes and 1,306 counterfactual variants.
  • 16,289 valid human responses formed the psychophysics baseline.
  • VLMs show size bias: overreliance on large objects relative to humans.
  • VLMs show center bias: overreliance on objects at image center.
  • VLMs over-rely on high-saliency objects.
  • The study reveals a pervasive AI-human scene comprehension gap.

Entities

Sources