Video Models as Generalist Robot Policies via Inverse Dynamics

ai-technology · 2026-05-28

A new approach from arXiv (2605.27817) proposes using video generative models as generalist robot policies without finetuning. Instead of training robot foundation models that jointly predict observations and actions, the method leaves the video planner unchanged and trains an embodiment-specific inverse dynamics model (IDM). This decoupling allows the video planner to remain embodiment-agnostic, enables easy interchange of different video models without retraining the IDM, and permits independent training of the IDM using self-play data. The system combines an action-free video world model with a carefully designed IDM based on the robot embodiment Jacobian, forming a closed-loop video-to-action policy.

Key facts

arXiv paper 2605.27817 proposes using video generative models as robot policies.
The approach leaves the video planner unchanged and trains an inverse dynamics model (IDM).
The IDM is embodiment-specific and based on the robot embodiment Jacobian.
The video planner remains embodiment-agnostic.
Different video models can be interchanged without retraining the IDM.
The IDM can be trained independently using self-play data.
The system forms a closed-loop video-to-action policy.
The method avoids finetuning video models with action-labeled data.

Entities

—

Sources

arXiv cs.AI — 2026-05-28