SpatialForge: AI Pipeline for 3D Spatial Reasoning from 2D Images
SpatialForge, a scalable data synthesis pipeline, transforms in-the-wild 2D images into spatial reasoning supervision for Large Vision-Language Models (VLMs). Current VLMs excel at semantic understanding but fail at geometric tasks like depth ordering and coordinate grounding. Existing spatial supervision relies on scene-centric datasets (multi-view scans, indoor video) limited in scale and diversity compared to web-scale 2D images. SpatialForge decomposes spatial reasoning into perception and relation, constructing structured supervision signals for depth, layout, and viewpoint-dependent reasoning with automatic verification. The approach addresses the bottleneck of scarce 3D training data by leveraging abundant 2D imagery. The paper is available on arXiv (2605.11462).
Key facts
- SpatialForge is a scalable data synthesis pipeline for 3D spatial reasoning.
- It transforms in-the-wild 2D images into spatial reasoning supervision.
- Current VLMs struggle with spatial reasoning tasks like depth ordering and coordinate grounding.
- Existing spatial supervision uses scene-centric datasets (multi-view scans, indoor video).
- Scene-centric datasets are limited in scale and diversity compared to web-scale 2D images.
- SpatialForge decomposes spatial reasoning into perception and relation.
- It constructs structured supervision signals for depth, layout, and viewpoint-dependent reasoning.
- The pipeline includes automatic verification.
- The paper is on arXiv with ID 2605.11462.
- The approach addresses the scarcity of diverse 3D training data.
Entities
Institutions
- arXiv