SpatialUncertain: VLMs Fail to Recognize Unanswerable Spatial Questions
A new research paper from arXiv introduces SpatialUncertain, a framework testing whether vision-language models (VLMs) know when not to answer spatial questions. The study identifies two key observation challenges: occlusion, which hides target information, and perspective ambiguity, which produces misleading visual cues. Existing benchmarks assume observations are sufficient, focusing on correct answers rather than recognizing unanswerable questions. The paper argues that visual observations are inherently limited representations of a 3D world, where occlusion and perspective can mislead. SpatialUncertain designs spatial questions that are answerable under clean conditions but become unanswerable under these challenges. The work highlights a critical gap in VLM spatial reasoning: models often fail to acknowledge uncertainty and cannot identify what additional observations would be needed. The findings have implications for deploying VLMs in real-world environments where visual data is incomplete or ambiguous.
Key facts
- arXiv paper ID: 2605.30557v1
- SpatialUncertain framework introduced
- Two observation challenges: occlusion and perspective ambiguity
- Existing benchmarks assume observations are sufficient
- VLMs fail to recognize when spatial questions cannot be answered
- Visual observations are limited representations of 3D world
- Occlusion hides target information
- Perspective ambiguity produces misleading visual cues
Entities
Institutions
- arXiv