ARTFEED — Contemporary Art Intelligence

PhysBrain 1.0: Scaling Physical Commonsense from Human Video

ai-technology · 2026-05-18

The technical report titled PhysBrain 1.0 presents an innovative method for integrating physical commonsense into vision-language-action models by transforming extensive human egocentric video into structured supervision. This data engine identifies scene components, spatial interactions, action performances, and depth-aware relationships, subsequently creating question-answer pairs for training PhysBrain VLMs. These physical insights are then adapted to VLA policies through a language-sensitive, capability-preserving approach. The model sets new benchmarks in multimodal QA and embodied control, excelling in ERQA, PhysBench, SimplerEnv-WidowX, LIBERO, and RoboCasa, and demonstrating exceptional out-of-domain capabilities on SimplerEnv. The findings indicate that leveraging physical commonsense from human interaction videos can greatly improve robot comprehension and adaptability.

Key facts

  • PhysBrain 1.0 uses human egocentric video to generate physical commonsense supervision.
  • The data engine extracts scene elements, spatial dynamics, action execution, and depth-aware relations.
  • Question-answer pairs are created from extracted data to train PhysBrain VLMs.
  • Physical priors are transferred to VLA policies via capability-preserving, language-sensitive adaptation.
  • Benchmarks include ERQA, PhysBench, SimplerEnv-WidowX, LIBERO, and RoboCasa.
  • Achieves state-of-the-art results across all benchmarks.
  • Strong out-of-domain performance on SimplerEnv.
  • Report published on arXiv with ID 2605.15298.

Entities

Institutions

  • arXiv

Sources