PhysMem Framework Enables Vision-Language Models to Learn Physical Principles Through Robot Interaction

ai-technology · 2026-04-22

A new memory framework called PhysMem allows vision-language model (VLM) planners to acquire knowledge of physical properties through direct robot interaction during test time, without requiring updates to model parameters. The system addresses limitations where VLMs can reason generally about concepts like friction and stability but struggle to predict specific outcomes—such as how a particular ball will roll on a given surface or which stone offers stable support—without firsthand experience. PhysMem operates by recording interactions, generating candidate hypotheses, and verifying them through targeted testing before applying validated knowledge to future decisions. A key design principle is verification before application: hypotheses are tested against new observations rather than applying retrieved experience directly, which reduces rigid dependence on prior experience when physical conditions change. This approach enhances reliability in object manipulation by enabling robots to learn and adapt to varying physical properties across different objects and environments. The framework is documented in the arXiv preprint 2602.20323v5, which was announced as a replacement cross-type submission. By learning physical principles from interaction, PhysMem aims to improve the adaptability and performance of VLM-based robot planners in real-world scenarios where physical properties are not static.

Key facts

PhysMem is a memory framework for vision-language model (VLM) robot planners.
It enables learning of physical principles from interaction at test time without updating model parameters.
The system records experiences, generates hypotheses, and verifies them through targeted interaction.
Verification before application reduces rigid reliance on prior experience when physical conditions change.
It addresses VLM limitations in predicting specific outcomes like how a ball rolls on a surface.
The goal is to improve reliability in object manipulation across varying objects and environments.
The framework is detailed in arXiv preprint 2602.20323v5, announced as replace-cross.
PhysMem helps robots adapt to physical properties that vary across objects and environments.

Entities

—

Sources

arXiv cs.AI — 2026-04-22