SpatialUncertain: VLMs Fail to Recognize Unanswerable Spatial Questions

ai-technology · 2026-06-01

A new research paper from arXiv introduces SpatialUncertain, a framework testing whether vision-language models (VLMs) know when not to answer spatial questions. The study identifies two key observation challenges: occlusion, which hides target information, and perspective ambiguity, which produces misleading visual cues. Existing benchmarks assume observations are sufficient, focusing on correct answers rather than recognizing unanswerable questions. The paper argues that visual observations are inherently limited representations of a 3D world, where occlusion and perspective can mislead. SpatialUncertain designs spatial questions that are answerable under clean conditions but become unanswerable under these challenges. The work highlights a critical gap in VLM spatial reasoning: models often fail to acknowledge uncertainty and cannot identify what additional observations would be needed. The findings have implications for deploying VLMs in real-world environments where visual data is incomplete or ambiguous.

Key facts

arXiv paper ID: 2605.30557v1
SpatialUncertain framework introduced
Two observation challenges: occlusion and perspective ambiguity
Existing benchmarks assume observations are sufficient
VLMs fail to recognize when spatial questions cannot be answered
Visual observations are limited representations of 3D world
Occlusion hides target information
Perspective ambiguity produces misleading visual cues

SpatialUncertain: VLMs Fail to Recognize Unanswerable Spatial Questions

Key facts

Entities

Institutions

Sources