SpatialAct Benchmark Tests VLM Spatial Reasoning-to-Action Gap
A new benchmark called SpatialAct has been developed by researchers to evaluate action-conditioned spatial reasoning in vision-language models (VLMs) within three-dimensional environments. This benchmark investigates the ability of VLMs to create a coherent spatial understanding, execute actions based on that understanding, and improve those actions through multi-turn feedback. It features a Multi-turn Interactive Refinement setting, a Single-step Error Detection and Fix task, and five essential spatial ability tasks aimed at identifying model shortcomings. Findings indicate a significant gap between reasoning and action, revealing that while VLMs excel in observation-based spatial perception and reasoning, they face challenges in effectively translating this reasoning into actions. The research underscores the limitations of current VLMs in handling real-world spatial tasks.
Key facts
- SpatialAct is a simulator-grounded benchmark for action-conditioned spatial reasoning in 3D scenes.
- The benchmark includes Multi-turn Interactive Refinement and Single-step Error Detection and Fix tasks.
- Five fundamental spatial ability tasks are used to diagnose underlying causes of model failures.
- Experiments reveal a clear reasoning-to-action gap in VLMs.
- VLMs show promising performance on observation-conditioned spatial perception and reasoning.
- The study questions whether VLMs can build coherent spatial understanding and act upon it.
- Multi-turn feedback is used to refine actions in the benchmark.
- The research was published on arXiv with identifier 2605.31148.
Entities
Institutions
- arXiv