SpatialForge: AI Pipeline for 3D Spatial Reasoning from 2D Images

ai-technology · 2026-05-13

SpatialForge, a scalable data synthesis pipeline, transforms in-the-wild 2D images into spatial reasoning supervision for Large Vision-Language Models (VLMs). Current VLMs excel at semantic understanding but fail at geometric tasks like depth ordering and coordinate grounding. Existing spatial supervision relies on scene-centric datasets (multi-view scans, indoor video) limited in scale and diversity compared to web-scale 2D images. SpatialForge decomposes spatial reasoning into perception and relation, constructing structured supervision signals for depth, layout, and viewpoint-dependent reasoning with automatic verification. The approach addresses the bottleneck of scarce 3D training data by leveraging abundant 2D imagery. The paper is available on arXiv (2605.11462).

Key facts

SpatialForge is a scalable data synthesis pipeline for 3D spatial reasoning.
It transforms in-the-wild 2D images into spatial reasoning supervision.
Current VLMs struggle with spatial reasoning tasks like depth ordering and coordinate grounding.
Existing spatial supervision uses scene-centric datasets (multi-view scans, indoor video).
Scene-centric datasets are limited in scale and diversity compared to web-scale 2D images.
SpatialForge decomposes spatial reasoning into perception and relation.
It constructs structured supervision signals for depth, layout, and viewpoint-dependent reasoning.
The pipeline includes automatic verification.
The paper is on arXiv with ID 2605.11462.
The approach addresses the scarcity of diverse 3D training data.

SpatialForge: AI Pipeline for 3D Spatial Reasoning from 2D Images

Key facts

Entities

Institutions

Sources