VLAs' Safety and Robustness in Open-World Environments Questioned

publication · 2026-04-25

A new paper on arXiv (2604.21192) argues that current evaluation protocols for vision-language-action models (VLAs) in open-world environments, such as the BEHAVIOR1K (B1K) benchmark, overlook safety and exaggerate performance. The authors analyze state-of-the-art models on the B1K Challenge, assessing policies for robustness via reproducibility and consistency. They claim that metrics based solely on final object states ignore the events leading to those states, undermining core challenges for real-world deployment.

Key facts

Paper arXiv:2604.21192 critiques VLA evaluation protocols.
VLAs are used in robotics for long-horizon tasks like household chores.
BEHAVIOR1K (B1K) benchmark is used for evaluating complex household tasks.
Current metrics only consider final object states, not intermediate events.
Authors argue this exaggerates reported performance and ignores safety.
Analysis focuses on state-of-the-art models on the B1K Challenge.
Policies are evaluated for robustness via reproducibility and consistency.
Paper claims current protocols undermine real-world deployment challenges.

VLAs' Safety and Robustness in Open-World Environments Questioned

Key facts

Entities

Institutions

Sources