Residual Latent Action Enables Efficient Visual Feature-Based World Models
A new arXiv paper introduces Residual Latent Action (RLA), a latent action representation learned from DINO residuals, and proposes RLA-WM, a world model that predicts RLA via flow matching. RLA-WM outperforms existing feature-based world models by avoiding blurry or collapsed predictions in complex interactions, addressing the challenge of generative modeling in high-dimensional feature spaces. The work demonstrates that RLA is predictive, generalizable, and encodes temporal progression, offering a more efficient and less hallucination-prone alternative to image-generation-based world models.
Key facts
- Residual Latent Action (RLA) is a new type of latent action representation.
- RLA is learned from DINO residuals.
- RLA-WM predicts RLA values via flow matching.
- RLA-WM outperforms both state-of-the-art feature-based world models.
- Existing feature-based approaches rely on direct regression, leading to blurry or collapsed predictions.
- Generative modeling in high-dimensional feature spaces remains challenging.
- RLA is predictive, generalizable, and encodes temporal progression.
- The paper is published on arXiv with ID 2605.07079.
Entities
Institutions
- arXiv