PhysNote: Self-Knowledge Notes for Evolvable Physical Reasoning in Vision-Language Models

other · 2026-04-29

A new framework called PhysNote aims to improve Vision-Language Models' (VLMs) physical reasoning in dynamic real-world scenarios. VLMs excel at textbook physics but struggle with temporal consistency and causal reasoning across frames due to spatio-temporal identity drift and volatility of inference-time insights. PhysNote addresses these by externalizing and refining physical knowledge through self-generated Knowledge Notes, stabilizing dynamic perception via spatio-temporal canonicalization, and organizing insights into a hierarchical repository for iterative improvement. The framework is detailed in a paper on arXiv (2604.24443).

Key facts

VLMs fail in dynamic real-world scenarios requiring temporal consistency and causal reasoning.
Two challenges: spatio-temporal identity drift and volatility of inference-time insights.
PhysNote uses self-generated Knowledge Notes to externalize and refine physical knowledge.
It stabilizes dynamic perception through spatio-temporal canonicalization.
Insights are organized into a hierarchical knowledge repository.
The framework drives iterative improvement.
Paper available on arXiv with ID 2604.24443.
Announcement type is new.

PhysNote: Self-Knowledge Notes for Evolvable Physical Reasoning in Vision-Language Models

Key facts

Entities

Institutions

Sources