X-WAM: Unified 4D World Model for Robotics and Video Synthesis
Researchers have introduced X-WAM, an integrated 4D world model that merges real-time robotic action execution with high-fidelity 4D world synthesis, encompassing video and 3D reconstruction, all within a single framework. This model overcomes the shortcomings of previous models like UWM, which only operate in 2D pixel space and struggle to achieve a balance between action efficiency and world modeling quality. X-WAM utilizes pretrained video diffusion models to predict multi-view RGB-D videos and gathers spatial data through a lightweight structural adaptation, which involves replicating the last few blocks of the pretrained Diffusion Transformer into a specific depth prediction branch. The method of Asynchronous Noise Sampling (ANS) enhances both generation quality and action decoding efficiency through a tailored asynchronous denoising schedule. The research can be found on arXiv with ID 2604.26694.
Key facts
- X-WAM unifies real-time robotic action execution and high-fidelity 4D world synthesis.
- It addresses limitations of prior unified world models like UWM.
- X-WAM predicts multi-view RGB-D videos using pretrained video diffusion models.
- Spatial information is obtained via a lightweight structural adaptation: replicating final blocks of Diffusion Transformer into a depth prediction branch.
- Asynchronous Noise Sampling (ANS) jointly optimizes generation quality and action decoding efficiency.
- The paper is available on arXiv with ID 2604.26694.
- X-WAM stands for Unified 4D World Model.
- The approach leverages strong visual priors of pretrained video diffusion models.
Entities
Institutions
- arXiv