Residual Latent Action Enables Efficient Visual Feature-Based World Models

ai-technology · 2026-05-11

A new arXiv paper introduces Residual Latent Action (RLA), a latent action representation learned from DINO residuals, and proposes RLA-WM, a world model that predicts RLA via flow matching. RLA-WM outperforms existing feature-based world models by avoiding blurry or collapsed predictions in complex interactions, addressing the challenge of generative modeling in high-dimensional feature spaces. The work demonstrates that RLA is predictive, generalizable, and encodes temporal progression, offering a more efficient and less hallucination-prone alternative to image-generation-based world models.

Key facts

Residual Latent Action (RLA) is a new type of latent action representation.
RLA is learned from DINO residuals.
RLA-WM predicts RLA values via flow matching.
RLA-WM outperforms both state-of-the-art feature-based world models.
Existing feature-based approaches rely on direct regression, leading to blurry or collapsed predictions.
Generative modeling in high-dimensional feature spaces remains challenging.
RLA is predictive, generalizable, and encodes temporal progression.
The paper is published on arXiv with ID 2605.07079.

Residual Latent Action Enables Efficient Visual Feature-Based World Models

Key facts

Entities

Institutions

Sources