Pre-VLA: Runtime Verification for Vision-Language-Action Models

other · 2026-05-23

Pre-VLA is an architecture for runtime verification that evaluates the legitimacy of actions produced by extensive vision-language-action (VLA) models and generative world models prior to their physical implementation or world-model rollouts. This system tackles the uncertainties associated with learning-based action generation, which can lead to physical errors or inaccurate simulations. Utilizing a robust multimodal backbone with modality-aware pooling, it features a streamlined dual-branch head to forecast safety confidence and critic-derived advantage scores for potential action segments. The training process employs a multi-task objective that integrates Focal classification, advantage regression, and soft-target losses to address class imbalance and boundary decisions, ultimately enhancing reliability in long-horizon embodied intelligence tasks.

Key facts

Pre-VLA performs preemptive action validity assessment before execution or imagination.
It uses a multimodal backbone with modality-aware pooling.
A dual-branch head predicts safety confidence and advantage scores.
Training combines Focal classification, advantage regression, and soft-target losses.
Addresses class imbalance and unstable boundary decisions.
Targets long-horizon embodied intelligence with VLA and world models.
Aims to prevent physical failures and reduce redundant rendering costs.
Published on arXiv with ID 2605.22446.

Pre-VLA: Runtime Verification for Vision-Language-Action Models

Key facts

Entities

Institutions

Sources