Latent Video Prediction Models Outperform in World Model Robustness
A recent preprint on arXiv introduces the inaugural comprehensive examination of video foundation models as world models, focusing on five axes of robustness. This research evaluates four frontier models with matched capacities: V-JEPA 2.1, V-JEPA 2, VideoPrism, and VideoMAEv2. The findings reveal that latent-prediction models, specifically the V-JEPA variants, consistently excel in areas such as feature discriminability, resilience to corruption, fine-grained discrimination, occlusion robustness, and sensitivity to temporal direction. These models show a more gradual decline in performance under pixel corruption, maintain class structure during occlusion, and effectively capture subtle physical contact cues. This study fills a crucial gap by assessing performance beyond mere top-1 accuracy on clean benchmarks.
Key facts
- arXiv:2605.15618
- Study analyzes V-JEPA 2.1, V-JEPA 2, VideoPrism, and VideoMAEv2
- Five robustness axes: feature discriminability, corruption robustness, fine-grained discrimination, occlusion robustness, temporal direction sensitivity
- Latent-prediction models form a distinct profile across all axes
- They degrade more gracefully under pixel corruption
- They preserve class structure under occlusion
- They capture fine-grained physical contact cues
- First systematic study of video models as world models
Entities
Institutions
- arXiv