EgoBabyVLM Benchmark Tests Cross-Modal Learning from Egocentric Video
A new research paper introduces EgoBabyVLM, a benchmark for evaluating vision-language models (VLMs) on naturalistic egocentric video data. The study finds that current VLMs trained on curated web data fail to generalize to sparse, weakly-aligned streams from wearable devices, embodied agents, and infant head-cams. Researchers trained VLMs on datasets with varying semantic alignment, including infant and adult egocentric videos, and evaluated them using a comprehensive suite called Machine-DevBench. This benchmark automatically generates lexical and grammatical competence tests from the model's training vocabulary across logarithmic scales. The work highlights limitations in cross-modal learning and provides a standardized evaluation pipeline for this regime.
Key facts
- Paper titled 'EgoBabyVLM: Benchmarking Cross-Modal Learning from Naturalistic Egocentric Video Data'
- Published on arXiv with ID 2605.19130
- VLMs trained on curated web data fail to generalize to egocentric streams
- No fixed evaluation pipeline existed for this regime
- Datasets include infant and adult egocentric videos
- Machine-DevBench is the core evaluation suite
- Benchmark automatically generated from training vocabulary
- Study addresses cross-modal learning from sparse, weakly-aligned data
Entities
Institutions
- arXiv