VLADriver-RAG: Retrieval-Augmented VLA Models for Autonomous Driving
The newly introduced VLADriver-RAG framework tackles the shortcomings of Vision-Language-Action (VLA) models in autonomous driving, specifically their inadequate performance in long-tail situations. Traditional visual retrieval is hampered by significant latency and semantic confusion. To mitigate this, VLADriver-RAG employs a Visual-to-Scenario approach that transforms sensory data into spatiotemporal semantic graphs, effectively reducing visual clutter. Additionally, a Scenario-Aligned Embedding Model utilizes Graph-DTW metric alignment, emphasizing topological coherence rather than mere visual resemblance. The model enhances planning by integrating retrieved priors. This research has been made available on arXiv, identified by ID 2605.08133.
Key facts
- VLADriver-RAG is a framework for autonomous driving.
- It enhances Vision-Language-Action (VLA) models.
- Addresses generalization in long-tail scenarios.
- Uses Visual-to-Scenario mechanism to create spatiotemporal semantic graphs.
- Employs Scenario-Aligned Embedding Model with Graph-DTW metric alignment.
- Prioritizes topological consistency over visual similarity.
- Retrieved priors are fused within the model.
- Paper available on arXiv with ID 2605.08133.
Entities
Institutions
- arXiv