LVDrive: Latent Visual Representation for Autonomous Driving
A new framework called LVDrive enhances Vision-Language-Action (VLA) models for autonomous driving by introducing a future scene prediction task in a high-level latent space, avoiding pixel-level reconstruction. The approach uses a pretrained vision backbone for auxiliary supervision and jointly models future scene and motion prediction in a unified embedding space, processed in a single forward pass. This addresses the issue of sparse action supervision in existing VLAs and the overemphasis on pixel-level image reconstruction in world modeling. The paper is available on arXiv with ID 2605.22089.
Key facts
- LVDrive is a Latent Visual representation enhanced VLA framework for autonomous driving.
- It introduces a future scene prediction task into the VLA paradigm.
- Future representations are learned in a high-level latent space under auxiliary supervision from a pretrained vision backbone.
- The framework jointly models future scene and motion prediction within a unified embedding space.
- Processing is done in a single forward pass, departing from inefficient autoregressive generation.
- Existing VLAs rely on sparse action supervision, underutilizing scene understanding capabilities.
- Previous attempts with dense visual supervision via world modeling overemphasize pixel-level reconstruction.
- The paper is published on arXiv with ID 2605.22089.
Entities
Institutions
- arXiv