VL-DPO: Vision-Language-Guided Finetuning for Preference-Aligned Autonomous Driving
A new framework called VL-DPO uses vision-language models to align autonomous driving motion forecasting with human preferences. The approach generates preference pairs from a pretrained model's rollouts via a VLM zero-shot reasoner, then finetunes using Direct Preference Optimization (DPO). Models are trained on the Waymo Open End-to-End Driving Dataset (WOD-E2E) and evaluated against human preference annotations. The work addresses limitations of standard imitation learning in capturing nuanced driving preferences.
Key facts
- VL-DPO is a vision-language-guided framework for aligning ego-vehicle motion forecasting models with human preferences.
- It uses a VLM as a zero-shot reasoner to automatically generate preference pairs from a pretrained model's rollouts.
- Finetuning is performed via Direct Preference Optimization (DPO).
- Models are finetuned on the Waymo Open End-to-End Driving Dataset (WOD-E2E).
- Performance is evaluated against held-out human preference annotations.
- The approach aims to capture complex nuances of human driving preferences beyond standard imitation objectives.
- The paper is published on arXiv with ID 2605.20082.
- The work builds on recent advances in vision-language models (VLMs) for reasoning and commonsense understanding.
Entities
Institutions
- arXiv
- Waymo