DiLA: Disentangled Latent Action World Models for Video Generation

ai-technology · 2026-05-18

A team of researchers has unveiled DiLA, an innovative Disentangled Latent Action world model that addresses the balance between action abstraction and generation accuracy in Latent Action Models (LAMs). DiLA accomplishes this by disentangling content and structure, allowing the predictive bottleneck in latent action learning to differentiate spatial arrangements (structure) from visual specifics (content). This integration enables the creation of continuous, semantically organized latent actions without the need for two-stage training or constraints related to optical flow. The findings are detailed in a paper available on arXiv under ID 2605.15725.

Key facts

DiLA stands for Disentangled Latent Action world model.
It addresses the trade-off between action abstraction and generation fidelity in LAMs.
The method uses content-structure disentanglement.
Latent action learning's predictive bottleneck drives disentanglement.
The model separates spatial layouts into structure pathway and visual details into content pathway.
No two-stage training or optical flow constraints are needed.
The paper is available on arXiv with ID 2605.15725.
The approach yields continuous, semantically structured latent actions.

DiLA: Disentangled Latent Action World Models for Video Generation

Key facts

Entities

Institutions

Sources