TC-WM: Compact World Models from Visual Foundation Models

ai-technology · 2026-05-26

The TC-WM framework has been developed to tackle the issue of deriving compact world models from visual foundation models for effective planning and control. While world models enable agents to forecast future dynamics based on their actions, current representations are either derived from pixel data lacking semantic depth or taken from static foundation models that contain unnecessary details. This presents a significant challenge in reward-free offline scenarios, where the model learns from predetermined trajectories without reward guidance or real-time interaction. TC-WM utilizes the pretrained embedding space as a semantic foundation, projecting high-dimensional visual embeddings into a more compact latent dynamic space. This method seeks to generate adequate state representations for subsequent planning and control tasks. The paper can be found on arXiv with the identifier 2605.25620.

Key facts

TC-WM is a framework for turning foundation-model embeddings into compact, task-sufficient world representations.
It addresses the challenge of learning world models in reward-free offline settings.
The key design is to treat the pretrained embedding space as a semantic scaffold.
TC-WM linearly projects high-dimensional visual embeddings into a compact latent space.
The paper is available on arXiv with identifier 2605.25620.
World models enable agents to predict future dynamics conditioned on actions.
Existing representations are either learned from pixels or inherited from frozen foundation models.
The approach aims to improve planning and control in downstream tasks.

TC-WM: Compact World Models from Visual Foundation Models

Key facts

Entities

Institutions

Sources