Text-Guided Dual-Gaze Prediction for Object-Level Driver Attention

ai-technology · 2026-04-24

Researchers have rolled out a new framework designed to improve how autonomous vehicles predict where to focus their attention on objects. They created the G-W3DA dataset by blending a large multimodal language model with the Segment Anything Model 3 (SAM3). This combination allows them to break down broader heatmaps into specific object-level masks using rigorous cross-validation. This development addresses the limitations of existing datasets, which typically only provide global scene-level gaze. Such limitations can lead to issues like text-vision decoupling and visual biases in Vision-Language Models (VLMs). The study details a thorough method that covers everything from creating the data to designing the model.

Key facts

arXiv:2604.20191v1
Published on arXiv
Proposes dual-branch gaze prediction framework
Constructs G-W3DA dataset
Uses multimodal large language model and SAM3
Decouples heatmaps into object-level masks
Addresses scene-level gaze limitations
Targets human-like autonomous driving

Text-Guided Dual-Gaze Prediction for Object-Level Driver Attention

Key facts

Entities

Institutions

Sources