SpaceNum Framework Tests VLMs' Spatial Numerical Understanding

ai-technology · 2026-05-25

A new study introduces SpaceNum, a unified framework to evaluate whether Vision-Language Models (VLMs) genuinely ground numerical outputs in spatial perception. The framework captures two settings: numbers as dynamic transitions during spatial exploration and numbers as static layouts in spatial reasoning. Two bidirectional tasks, Num2Space and Space2Num, assess how VLMs map between visual spatial structure and language-side numerical representations. Across both settings, models largely fail to ground numbers in spatial context, raising concerns for embodied AI applications. The paper is available on arXiv under ID 2605.23898.

Key facts

SpaceNum is a unified framework for evaluating spatial numerical understanding in VLMs.
It covers two settings: dynamic transitions and static layouts.
Two bidirectional tasks are Num2Space and Space2Num.
VLMs largely fail to ground numbers in spatial perception.
The study is published on arXiv with ID 2605.23898.
The research revisits whether numerical outputs are genuinely grounded in spatial perception.
VLMs are increasingly used in embodied environments requiring numerical outputs.
The paper systematically studies current VLMs' understanding of numerical values in spatial settings.

SpaceNum Framework Tests VLMs' Spatial Numerical Understanding

Key facts

Entities

Institutions

Sources