ARTFEED — Contemporary Art Intelligence

SpatialForge: AI Pipeline for 3D Spatial Reasoning from 2D Images

ai-technology · 2026-05-13

SpatialForge, a scalable data synthesis pipeline, transforms in-the-wild 2D images into spatial reasoning supervision for Large Vision-Language Models (VLMs). Current VLMs excel at semantic understanding but fail at geometric tasks like depth ordering and coordinate grounding. Existing spatial supervision relies on scene-centric datasets (multi-view scans, indoor video) limited in scale and diversity compared to web-scale 2D images. SpatialForge decomposes spatial reasoning into perception and relation, constructing structured supervision signals for depth, layout, and viewpoint-dependent reasoning with automatic verification. The approach addresses the bottleneck of scarce 3D training data by leveraging abundant 2D imagery. The paper is available on arXiv (2605.11462).

Key facts

  • SpatialForge is a scalable data synthesis pipeline for 3D spatial reasoning.
  • It transforms in-the-wild 2D images into spatial reasoning supervision.
  • Current VLMs struggle with spatial reasoning tasks like depth ordering and coordinate grounding.
  • Existing spatial supervision uses scene-centric datasets (multi-view scans, indoor video).
  • Scene-centric datasets are limited in scale and diversity compared to web-scale 2D images.
  • SpatialForge decomposes spatial reasoning into perception and relation.
  • It constructs structured supervision signals for depth, layout, and viewpoint-dependent reasoning.
  • The pipeline includes automatic verification.
  • The paper is on arXiv with ID 2605.11462.
  • The approach addresses the scarcity of diverse 3D training data.

Entities

Institutions

  • arXiv

Sources