Unlabeled Internet Videos Boost 3D Scene Understanding
Researchers demonstrate that unlabeled videos from the internet can be automatically processed to generate training data for 3D scene understanding models. The approach, detailed in arXiv:2604.01907, uses a carefully designed data engine to curate web videos and produce annotations for tasks like 3D object detection, instance segmentation, spatial VQA, and vision-language navigation. Models trained on this synthetic data show strong zero-shot performance, reducing reliance on scarce and expensive human-annotated 3D datasets. The study identifies key factors in automated data generation that affect learning efficiency.
Key facts
- arXiv:2604.01907
- unlabeled internet videos used for 3D scene understanding
- data engine automatically generates training data
- evaluated on 3D object detection, instance segmentation, spatial VQA, VLN
- zero-shot performance demonstrated
- reduces need for human-annotated 3D data
- bottlenecks in automated data generation analyzed
- low-level perception and high-level reasoning tasks covered
Entities
Institutions
- arXiv