Counterfactual Semantic Salience Reveals VLM-Human Scene Perception Gap

ai-technology · 2026-05-14

A new black-box framework, Counterfactual Semantic Saliency (CSS), quantifies how vision-language models (VLMs) differ from humans in scene understanding. The method measures object importance by causally removing it from a scene and tracking semantic shift. Testing on 307 natural scenes with 1,306 counterfactual variants and 16,289 human responses, researchers found VLMs over-rely on large objects, center placement, and high salience compared to humans. The study highlights a pervasive comprehension gap in AI-human semantic alignment.

Key facts

Counterfactual Semantic Saliency (CSS) is a black-box, model-agnostic framework.
CSS measures object importance via causal ablation and semantic shift.
Tested on 307 complex natural scenes and 1,306 counterfactual variants.
16,289 valid human responses formed the psychophysics baseline.
VLMs show size bias: overreliance on large objects relative to humans.
VLMs show center bias: overreliance on objects at image center.
VLMs over-rely on high-saliency objects.
The study reveals a pervasive AI-human scene comprehension gap.

Entities

—

Sources

arXiv cs.AI — 2026-05-14