ARTFEED — Contemporary Art Intelligence

Unlabeled Internet Videos Boost 3D Scene Understanding

ai-technology · 2026-04-27

Researchers demonstrate that unlabeled videos from the internet can be automatically processed to generate training data for 3D scene understanding models. The approach, detailed in arXiv:2604.01907, uses a carefully designed data engine to curate web videos and produce annotations for tasks like 3D object detection, instance segmentation, spatial VQA, and vision-language navigation. Models trained on this synthetic data show strong zero-shot performance, reducing reliance on scarce and expensive human-annotated 3D datasets. The study identifies key factors in automated data generation that affect learning efficiency.

Key facts

  • arXiv:2604.01907
  • unlabeled internet videos used for 3D scene understanding
  • data engine automatically generates training data
  • evaluated on 3D object detection, instance segmentation, spatial VQA, VLN
  • zero-shot performance demonstrated
  • reduces need for human-annotated 3D data
  • bottlenecks in automated data generation analyzed
  • low-level perception and high-level reasoning tasks covered

Entities

Institutions

  • arXiv

Sources