Study Probes Visual Dependency in VLA Autonomous Driving Models

other · 2026-06-01

A new study from arXiv investigates how Vision-Language-Action (VLA) models for autonomous driving rely on visual information. Researchers introduce a multi-level visual perturbation framework to systematically analyze visual-behavior dependency. The framework applies controlled perturbations across three dimensions: channel-level degradation, information-level disruption, and structure-level modification. The study evaluates behavioral responses in VLA-based driving systems under open-loop trajectory prediction and closed-loop control. Current evaluation protocols focus on aggregate metrics, lacking diagnostics to quantify visual-behavior dependency. This work aims to fill that gap by providing structured diagnostics.

Key facts

arXiv paper 2605.31041 introduces a visual perturbation framework for VLA driving models.
Framework has three perturbation dimensions: channel-level, information-level, structure-level.
Study evaluates both open-loop trajectory prediction and closed-loop control.
Current evaluation protocols lack structured diagnostics for visual-behavior dependency.
VLA models show promise in autonomous driving but visual grounding is poorly understood.
Research aims to systematically analyze how VLA driving behavior depends on visual input.

Study Probes Visual Dependency in VLA Autonomous Driving Models

Key facts

Entities

Institutions

Sources