RoboAlign-R1: Reward-Aligned Post-Training for Robot Video World Models

ai-technology · 2026-05-07

RoboAlign-R1 is a new framework designed to tackle the issue of misalignment in robot video world models by integrating reward-aligned post-training with stable long-horizon inference. Traditional models often focus on low-level objectives, such as reconstruction and perceptual similarity, which do not effectively correspond with a robot's decision-making abilities, including following instructions, achieving manipulation success, and ensuring physical plausibility. Additionally, these models experience error accumulation during long-horizon autoregressive predictions. To combat this, RoboAlign-R1 presents RobotWorldBench, a benchmark comprising 10,000 annotated video-instruction pairs sourced from four different robot datasets, along with RoboAlign-Judge, a multimodal teacher judge for detailed six-dimensional evaluation. This teacher is then distilled into a compact student reward model to facilitate efficient reinforcement-learning-based post-training, aiming to minimize long-horizon rollout drift and enhance alignment with task-specific outcomes.

Key facts

RoboAlign-R1 combines reward-aligned post-training with stabilized long-horizon inference.
Existing robot video world models are trained with low-level objectives like reconstruction and perceptual similarity.
These models suffer from error accumulation in long-horizon autoregressive prediction.
RobotWorldBench contains 10,000 annotated video-instruction pairs from four robot data sources.
RoboAlign-Judge is a multimodal teacher judge providing fine-grained six-dimensional evaluation.
The teacher is distilled into a lightweight student reward model for efficient RL-based post-training.
The framework targets instruction following, manipulation success, and physical plausibility.
The approach aims to reduce long-horizon rollout drift.

RoboAlign-R1: Reward-Aligned Post-Training for Robot Video World Models

Key facts

Entities

Institutions

Sources