Study Probes Visual Dependency in VLA Autonomous Driving Models
A new study from arXiv investigates how Vision-Language-Action (VLA) models for autonomous driving rely on visual information. Researchers introduce a multi-level visual perturbation framework to systematically analyze visual-behavior dependency. The framework applies controlled perturbations across three dimensions: channel-level degradation, information-level disruption, and structure-level modification. The study evaluates behavioral responses in VLA-based driving systems under open-loop trajectory prediction and closed-loop control. Current evaluation protocols focus on aggregate metrics, lacking diagnostics to quantify visual-behavior dependency. This work aims to fill that gap by providing structured diagnostics.
Key facts
- arXiv paper 2605.31041 introduces a visual perturbation framework for VLA driving models.
- Framework has three perturbation dimensions: channel-level, information-level, structure-level.
- Study evaluates both open-loop trajectory prediction and closed-loop control.
- Current evaluation protocols lack structured diagnostics for visual-behavior dependency.
- VLA models show promise in autonomous driving but visual grounding is poorly understood.
- Research aims to systematically analyze how VLA driving behavior depends on visual input.
Entities
Institutions
- arXiv