Unlabeled Internet Videos Boost 3D Scene Understanding

ai-technology · 2026-04-27

Researchers demonstrate that unlabeled videos from the internet can be automatically processed to generate training data for 3D scene understanding models. The approach, detailed in arXiv:2604.01907, uses a carefully designed data engine to curate web videos and produce annotations for tasks like 3D object detection, instance segmentation, spatial VQA, and vision-language navigation. Models trained on this synthetic data show strong zero-shot performance, reducing reliance on scarce and expensive human-annotated 3D datasets. The study identifies key factors in automated data generation that affect learning efficiency.

Key facts

arXiv:2604.01907
unlabeled internet videos used for 3D scene understanding
data engine automatically generates training data
evaluated on 3D object detection, instance segmentation, spatial VQA, VLN
zero-shot performance demonstrated
reduces need for human-annotated 3D data
bottlenecks in automated data generation analyzed
low-level perception and high-level reasoning tasks covered

Unlabeled Internet Videos Boost 3D Scene Understanding

Key facts

Entities

Institutions

Sources