3D Primitives as Spatial Language for VLMs

other · 2026-05-14

A recent study indicates that vision-language models (VLMs) are capable of reconstructing 3D scenes from basic geometric shapes such as cubes, spheres, and cylinders through executable code, yet struggle with simpler spatial inquiries related to the same images. The researchers have introduced SpatialBabel, a benchmark that assesses fourteen VLMs on their ability to reconstruct 3D scenes based on primitives across six scene-code languages, revealing that object-detection F1 scores can differ by as much as 5.7× among the languages. Additionally, they present Code-CoT (Code Chain-of-Thought), an inference approach that does not require training to enhance spatial comprehension. These results underscore a contradiction in VLM spatial reasoning and propose 3D primitives as a significant intermediate representation.

Key facts

VLMs can generate code to reconstruct 3D scenes from primitives but fail at simpler spatial questions.
SpatialBabel benchmark evaluates fourteen VLMs on primitive-based 3D reconstruction.
Six scene-code languages are used for 3D primitive scenes.
Object-detection F1 varies by up to 5.7× across languages.
Code-CoT is a training-free inference strategy proposed.
3D geometric primitives include cubes, spheres, cylinders.
The study is from arXiv:2605.12586.
The paradox highlights limitations in VLM spatial understanding.

3D Primitives as Spatial Language for VLMs

Key facts

Entities

Institutions

Sources