OmniDrive-R1: Reinforcement-Driven Visual Grounding for Autonomous Driving
Researchers have introduced OmniDrive-R1, a novel Vision-Language Model (VLM) framework for autonomous driving that addresses object hallucination through reinforcement-driven visual grounding. The framework employs an interleaved Multi-modal Chain-of-Thought (iMCoT) mechanism, unifying perception and reasoning in an end-to-end manner. Unlike previous approaches that suffer from decoupled stages and reliance on dense localization labels, OmniDrive-R1 enables the model to autonomously direct attention to critical regions for fine-grained analysis. This innovation aims to improve reliability in safety-critical driving scenarios. The work is detailed in a paper on arXiv (ID: 2512.14044).
Key facts
- OmniDrive-R1 is a VLM framework for autonomous driving.
- It uses reinforcement-driven visual grounding to reduce object hallucination.
- The framework employs an interleaved Multi-modal Chain-of-Thought (iMCoT) mechanism.
- It unifies perception and reasoning in an end-to-end manner.
- Previous approaches have decoupled perception and reasoning stages.
- Previous approaches rely on expensive dense localization labels.
- The model can autonomously direct attention to critical regions.
- The paper is available on arXiv with ID 2512.14044.
Entities
Institutions
- arXiv