Counterfactual Semantic Salience Reveals VLM-Human Scene Perception Gap
A new black-box framework, Counterfactual Semantic Saliency (CSS), quantifies how vision-language models (VLMs) differ from humans in scene understanding. The method measures object importance by causally removing it from a scene and tracking semantic shift. Testing on 307 natural scenes with 1,306 counterfactual variants and 16,289 human responses, researchers found VLMs over-rely on large objects, center placement, and high salience compared to humans. The study highlights a pervasive comprehension gap in AI-human semantic alignment.
Key facts
- Counterfactual Semantic Saliency (CSS) is a black-box, model-agnostic framework.
- CSS measures object importance via causal ablation and semantic shift.
- Tested on 307 complex natural scenes and 1,306 counterfactual variants.
- 16,289 valid human responses formed the psychophysics baseline.
- VLMs show size bias: overreliance on large objects relative to humans.
- VLMs show center bias: overreliance on objects at image center.
- VLMs over-rely on high-saliency objects.
- The study reveals a pervasive AI-human scene comprehension gap.
Entities
—