ARTFEED — Contemporary Art Intelligence

LVDrive: Latent Visual Representation for Autonomous Driving

ai-technology · 2026-05-23

A new framework called LVDrive enhances Vision-Language-Action (VLA) models for autonomous driving by introducing a future scene prediction task in a high-level latent space, avoiding pixel-level reconstruction. The approach uses a pretrained vision backbone for auxiliary supervision and jointly models future scene and motion prediction in a unified embedding space, processed in a single forward pass. This addresses the issue of sparse action supervision in existing VLAs and the overemphasis on pixel-level image reconstruction in world modeling. The paper is available on arXiv with ID 2605.22089.

Key facts

  • LVDrive is a Latent Visual representation enhanced VLA framework for autonomous driving.
  • It introduces a future scene prediction task into the VLA paradigm.
  • Future representations are learned in a high-level latent space under auxiliary supervision from a pretrained vision backbone.
  • The framework jointly models future scene and motion prediction within a unified embedding space.
  • Processing is done in a single forward pass, departing from inefficient autoregressive generation.
  • Existing VLAs rely on sparse action supervision, underutilizing scene understanding capabilities.
  • Previous attempts with dense visual supervision via world modeling overemphasize pixel-level reconstruction.
  • The paper is published on arXiv with ID 2605.22089.

Entities

Institutions

  • arXiv

Sources