Pre-VLA: Runtime Verification for Vision-Language-Action Models
Pre-VLA is an architecture for runtime verification that evaluates the legitimacy of actions produced by extensive vision-language-action (VLA) models and generative world models prior to their physical implementation or world-model rollouts. This system tackles the uncertainties associated with learning-based action generation, which can lead to physical errors or inaccurate simulations. Utilizing a robust multimodal backbone with modality-aware pooling, it features a streamlined dual-branch head to forecast safety confidence and critic-derived advantage scores for potential action segments. The training process employs a multi-task objective that integrates Focal classification, advantage regression, and soft-target losses to address class imbalance and boundary decisions, ultimately enhancing reliability in long-horizon embodied intelligence tasks.
Key facts
- Pre-VLA performs preemptive action validity assessment before execution or imagination.
- It uses a multimodal backbone with modality-aware pooling.
- A dual-branch head predicts safety confidence and advantage scores.
- Training combines Focal classification, advantage regression, and soft-target losses.
- Addresses class imbalance and unstable boundary decisions.
- Targets long-horizon embodied intelligence with VLA and world models.
- Aims to prevent physical failures and reduce redundant rendering costs.
- Published on arXiv with ID 2605.22446.
Entities
Institutions
- arXiv