PhysNote: Self-Knowledge Notes for Evolvable Physical Reasoning in Vision-Language Models
A new framework called PhysNote aims to improve Vision-Language Models' (VLMs) physical reasoning in dynamic real-world scenarios. VLMs excel at textbook physics but struggle with temporal consistency and causal reasoning across frames due to spatio-temporal identity drift and volatility of inference-time insights. PhysNote addresses these by externalizing and refining physical knowledge through self-generated Knowledge Notes, stabilizing dynamic perception via spatio-temporal canonicalization, and organizing insights into a hierarchical repository for iterative improvement. The framework is detailed in a paper on arXiv (2604.24443).
Key facts
- VLMs fail in dynamic real-world scenarios requiring temporal consistency and causal reasoning.
- Two challenges: spatio-temporal identity drift and volatility of inference-time insights.
- PhysNote uses self-generated Knowledge Notes to externalize and refine physical knowledge.
- It stabilizes dynamic perception through spatio-temporal canonicalization.
- Insights are organized into a hierarchical knowledge repository.
- The framework drives iterative improvement.
- Paper available on arXiv with ID 2604.24443.
- Announcement type is new.
Entities
Institutions
- arXiv