VLA Driving Models Show 42.5% Reasoning Fidelity, 94 Missed Pedestrians
A comprehensive investigation into the reliability of Vision-Language-Action (VLA) driving models has uncovered notable deficiencies in their reasoning capabilities. Researchers evaluated 300 Alpamayo-R1-10B inferences across 100 PhysicalAI-AV scenarios, revealing an overall reasoning fidelity of just 42.5%. The Chain-of-Causation was found to align with real-world scenes less than 50% of the time. The study identified 94 instances of missed pedestrians in one-third of relevant scenarios, with 97.7% trajectory instability under minor visual disturbances and a mere 48.3% average consistency between reasoning and action. Consistency was particularly low in 53.3% of inferences, including 37.9% of cases where the model incorrectly continued instead of stopping. This paper is the inaugural systematic analysis of faithfulness in VLA driving models, establishing information-theoretic definitions for fidelity and proposing a four-component safety framework.
Key facts
- First systematic study of faithfulness in VLA driving models
- Analyzed 300 Alpamayo-R1-10B inferences across 100 PhysicalAI-AV scenarios
- Overall reasoning fidelity is 42.5%
- 94 missed pedestrians in one-third of pedestrian-relevant scenes
- 97.7% trajectory fragility under mild visual perturbations
- 48.3% mean reasoning-action consistency
- 53.3% of inferences exhibit low consistency
- 37.9% of stop-claimed cases where model continues instead
Entities
Institutions
- PhysicalAI
- arXiv