X-WAM: Unified 4D World Model for Robotics and Video Synthesis

ai-technology · 2026-04-30

Researchers have introduced X-WAM, an integrated 4D world model that merges real-time robotic action execution with high-fidelity 4D world synthesis, encompassing video and 3D reconstruction, all within a single framework. This model overcomes the shortcomings of previous models like UWM, which only operate in 2D pixel space and struggle to achieve a balance between action efficiency and world modeling quality. X-WAM utilizes pretrained video diffusion models to predict multi-view RGB-D videos and gathers spatial data through a lightweight structural adaptation, which involves replicating the last few blocks of the pretrained Diffusion Transformer into a specific depth prediction branch. The method of Asynchronous Noise Sampling (ANS) enhances both generation quality and action decoding efficiency through a tailored asynchronous denoising schedule. The research can be found on arXiv with ID 2604.26694.

Key facts

X-WAM unifies real-time robotic action execution and high-fidelity 4D world synthesis.
It addresses limitations of prior unified world models like UWM.
X-WAM predicts multi-view RGB-D videos using pretrained video diffusion models.
Spatial information is obtained via a lightweight structural adaptation: replicating final blocks of Diffusion Transformer into a depth prediction branch.
Asynchronous Noise Sampling (ANS) jointly optimizes generation quality and action decoding efficiency.
The paper is available on arXiv with ID 2604.26694.
X-WAM stands for Unified 4D World Model.
The approach leverages strong visual priors of pretrained video diffusion models.

X-WAM: Unified 4D World Model for Robotics and Video Synthesis

Key facts

Entities

Institutions

Sources