LVDrive: Latent Visual Representation for Autonomous Driving

ai-technology · 2026-05-23

A new framework called LVDrive enhances Vision-Language-Action (VLA) models for autonomous driving by introducing a future scene prediction task in a high-level latent space, avoiding pixel-level reconstruction. The approach uses a pretrained vision backbone for auxiliary supervision and jointly models future scene and motion prediction in a unified embedding space, processed in a single forward pass. This addresses the issue of sparse action supervision in existing VLAs and the overemphasis on pixel-level image reconstruction in world modeling. The paper is available on arXiv with ID 2605.22089.

Key facts

LVDrive is a Latent Visual representation enhanced VLA framework for autonomous driving.
It introduces a future scene prediction task into the VLA paradigm.
Future representations are learned in a high-level latent space under auxiliary supervision from a pretrained vision backbone.
The framework jointly models future scene and motion prediction within a unified embedding space.
Processing is done in a single forward pass, departing from inefficient autoregressive generation.
Existing VLAs rely on sparse action supervision, underutilizing scene understanding capabilities.
Previous attempts with dense visual supervision via world modeling overemphasize pixel-level reconstruction.
The paper is published on arXiv with ID 2605.22089.

LVDrive: Latent Visual Representation for Autonomous Driving

Key facts

Entities

Institutions

Sources