PhysBrain 1.0: Scaling Physical Commonsense from Human Video
The technical report titled PhysBrain 1.0 presents an innovative method for integrating physical commonsense into vision-language-action models by transforming extensive human egocentric video into structured supervision. This data engine identifies scene components, spatial interactions, action performances, and depth-aware relationships, subsequently creating question-answer pairs for training PhysBrain VLMs. These physical insights are then adapted to VLA policies through a language-sensitive, capability-preserving approach. The model sets new benchmarks in multimodal QA and embodied control, excelling in ERQA, PhysBench, SimplerEnv-WidowX, LIBERO, and RoboCasa, and demonstrating exceptional out-of-domain capabilities on SimplerEnv. The findings indicate that leveraging physical commonsense from human interaction videos can greatly improve robot comprehension and adaptability.
Key facts
- PhysBrain 1.0 uses human egocentric video to generate physical commonsense supervision.
- The data engine extracts scene elements, spatial dynamics, action execution, and depth-aware relations.
- Question-answer pairs are created from extracted data to train PhysBrain VLMs.
- Physical priors are transferred to VLA policies via capability-preserving, language-sensitive adaptation.
- Benchmarks include ERQA, PhysBench, SimplerEnv-WidowX, LIBERO, and RoboCasa.
- Achieves state-of-the-art results across all benchmarks.
- Strong out-of-domain performance on SimplerEnv.
- Report published on arXiv with ID 2605.15298.
Entities
Institutions
- arXiv