Latent Video Prediction Models Outperform in World Model Robustness

ai-technology · 2026-05-18

A recent preprint on arXiv introduces the inaugural comprehensive examination of video foundation models as world models, focusing on five axes of robustness. This research evaluates four frontier models with matched capacities: V-JEPA 2.1, V-JEPA 2, VideoPrism, and VideoMAEv2. The findings reveal that latent-prediction models, specifically the V-JEPA variants, consistently excel in areas such as feature discriminability, resilience to corruption, fine-grained discrimination, occlusion robustness, and sensitivity to temporal direction. These models show a more gradual decline in performance under pixel corruption, maintain class structure during occlusion, and effectively capture subtle physical contact cues. This study fills a crucial gap by assessing performance beyond mere top-1 accuracy on clean benchmarks.

Key facts

arXiv:2605.15618
Study analyzes V-JEPA 2.1, V-JEPA 2, VideoPrism, and VideoMAEv2
Five robustness axes: feature discriminability, corruption robustness, fine-grained discrimination, occlusion robustness, temporal direction sensitivity
Latent-prediction models form a distinct profile across all axes
They degrade more gracefully under pixel corruption
They preserve class structure under occlusion
They capture fine-grained physical contact cues
First systematic study of video models as world models

Latent Video Prediction Models Outperform in World Model Robustness

Key facts

Entities

Institutions

Sources