RoboAlign-R1: Reward-Aligned Post-Training for Robot Video World Models
RoboAlign-R1 is a new framework designed to tackle the issue of misalignment in robot video world models by integrating reward-aligned post-training with stable long-horizon inference. Traditional models often focus on low-level objectives, such as reconstruction and perceptual similarity, which do not effectively correspond with a robot's decision-making abilities, including following instructions, achieving manipulation success, and ensuring physical plausibility. Additionally, these models experience error accumulation during long-horizon autoregressive predictions. To combat this, RoboAlign-R1 presents RobotWorldBench, a benchmark comprising 10,000 annotated video-instruction pairs sourced from four different robot datasets, along with RoboAlign-Judge, a multimodal teacher judge for detailed six-dimensional evaluation. This teacher is then distilled into a compact student reward model to facilitate efficient reinforcement-learning-based post-training, aiming to minimize long-horizon rollout drift and enhance alignment with task-specific outcomes.
Key facts
- RoboAlign-R1 combines reward-aligned post-training with stabilized long-horizon inference.
- Existing robot video world models are trained with low-level objectives like reconstruction and perceptual similarity.
- These models suffer from error accumulation in long-horizon autoregressive prediction.
- RobotWorldBench contains 10,000 annotated video-instruction pairs from four robot data sources.
- RoboAlign-Judge is a multimodal teacher judge providing fine-grained six-dimensional evaluation.
- The teacher is distilled into a lightweight student reward model for efficient RL-based post-training.
- The framework targets instruction following, manipulation success, and physical plausibility.
- The approach aims to reduce long-horizon rollout drift.
Entities
Institutions
- arXiv