ARTFEED — Contemporary Art Intelligence

PhysNote: Self-Knowledge Notes for Evolvable Physical Reasoning in Vision-Language Models

other · 2026-04-29

A new framework called PhysNote aims to improve Vision-Language Models' (VLMs) physical reasoning in dynamic real-world scenarios. VLMs excel at textbook physics but struggle with temporal consistency and causal reasoning across frames due to spatio-temporal identity drift and volatility of inference-time insights. PhysNote addresses these by externalizing and refining physical knowledge through self-generated Knowledge Notes, stabilizing dynamic perception via spatio-temporal canonicalization, and organizing insights into a hierarchical repository for iterative improvement. The framework is detailed in a paper on arXiv (2604.24443).

Key facts

  • VLMs fail in dynamic real-world scenarios requiring temporal consistency and causal reasoning.
  • Two challenges: spatio-temporal identity drift and volatility of inference-time insights.
  • PhysNote uses self-generated Knowledge Notes to externalize and refine physical knowledge.
  • It stabilizes dynamic perception through spatio-temporal canonicalization.
  • Insights are organized into a hierarchical knowledge repository.
  • The framework drives iterative improvement.
  • Paper available on arXiv with ID 2604.24443.
  • Announcement type is new.

Entities

Institutions

  • arXiv

Sources