VL-DPO: Vision-Language-Guided Finetuning for Preference-Aligned Autonomous Driving

other · 2026-05-20

A new framework called VL-DPO uses vision-language models to align autonomous driving motion forecasting with human preferences. The approach generates preference pairs from a pretrained model's rollouts via a VLM zero-shot reasoner, then finetunes using Direct Preference Optimization (DPO). Models are trained on the Waymo Open End-to-End Driving Dataset (WOD-E2E) and evaluated against human preference annotations. The work addresses limitations of standard imitation learning in capturing nuanced driving preferences.

Key facts

VL-DPO is a vision-language-guided framework for aligning ego-vehicle motion forecasting models with human preferences.
It uses a VLM as a zero-shot reasoner to automatically generate preference pairs from a pretrained model's rollouts.
Finetuning is performed via Direct Preference Optimization (DPO).
Models are finetuned on the Waymo Open End-to-End Driving Dataset (WOD-E2E).
Performance is evaluated against held-out human preference annotations.
The approach aims to capture complex nuances of human driving preferences beyond standard imitation objectives.
The paper is published on arXiv with ID 2605.20082.
The work builds on recent advances in vision-language models (VLMs) for reasoning and commonsense understanding.

VL-DPO: Vision-Language-Guided Finetuning for Preference-Aligned Autonomous Driving

Key facts

Entities

Institutions

Sources