PhysBrain 1.0: Scaling Physical Commonsense from Human Video

ai-technology · 2026-05-18

The technical report titled PhysBrain 1.0 presents an innovative method for integrating physical commonsense into vision-language-action models by transforming extensive human egocentric video into structured supervision. This data engine identifies scene components, spatial interactions, action performances, and depth-aware relationships, subsequently creating question-answer pairs for training PhysBrain VLMs. These physical insights are then adapted to VLA policies through a language-sensitive, capability-preserving approach. The model sets new benchmarks in multimodal QA and embodied control, excelling in ERQA, PhysBench, SimplerEnv-WidowX, LIBERO, and RoboCasa, and demonstrating exceptional out-of-domain capabilities on SimplerEnv. The findings indicate that leveraging physical commonsense from human interaction videos can greatly improve robot comprehension and adaptability.

Key facts

PhysBrain 1.0 uses human egocentric video to generate physical commonsense supervision.
The data engine extracts scene elements, spatial dynamics, action execution, and depth-aware relations.
Question-answer pairs are created from extracted data to train PhysBrain VLMs.
Physical priors are transferred to VLA policies via capability-preserving, language-sensitive adaptation.
Benchmarks include ERQA, PhysBench, SimplerEnv-WidowX, LIBERO, and RoboCasa.
Achieves state-of-the-art results across all benchmarks.
Strong out-of-domain performance on SimplerEnv.
Report published on arXiv with ID 2605.15298.

PhysBrain 1.0: Scaling Physical Commonsense from Human Video

Key facts

Entities

Institutions

Sources