SpatialAct Benchmark Tests VLM Spatial Reasoning-to-Action Gap

ai-technology · 2026-06-01

A new benchmark called SpatialAct has been developed by researchers to evaluate action-conditioned spatial reasoning in vision-language models (VLMs) within three-dimensional environments. This benchmark investigates the ability of VLMs to create a coherent spatial understanding, execute actions based on that understanding, and improve those actions through multi-turn feedback. It features a Multi-turn Interactive Refinement setting, a Single-step Error Detection and Fix task, and five essential spatial ability tasks aimed at identifying model shortcomings. Findings indicate a significant gap between reasoning and action, revealing that while VLMs excel in observation-based spatial perception and reasoning, they face challenges in effectively translating this reasoning into actions. The research underscores the limitations of current VLMs in handling real-world spatial tasks.

Key facts

SpatialAct is a simulator-grounded benchmark for action-conditioned spatial reasoning in 3D scenes.
The benchmark includes Multi-turn Interactive Refinement and Single-step Error Detection and Fix tasks.
Five fundamental spatial ability tasks are used to diagnose underlying causes of model failures.
Experiments reveal a clear reasoning-to-action gap in VLMs.
VLMs show promising performance on observation-conditioned spatial perception and reasoning.
The study questions whether VLMs can build coherent spatial understanding and act upon it.
Multi-turn feedback is used to refine actions in the benchmark.
The research was published on arXiv with identifier 2605.31148.

SpatialAct Benchmark Tests VLM Spatial Reasoning-to-Action Gap

Key facts

Entities

Institutions

Sources