ARTFEED — Contemporary Art Intelligence

Text-Guided Dual-Gaze Prediction for Object-Level Driver Attention

ai-technology · 2026-04-24

Researchers have rolled out a new framework designed to improve how autonomous vehicles predict where to focus their attention on objects. They created the G-W3DA dataset by blending a large multimodal language model with the Segment Anything Model 3 (SAM3). This combination allows them to break down broader heatmaps into specific object-level masks using rigorous cross-validation. This development addresses the limitations of existing datasets, which typically only provide global scene-level gaze. Such limitations can lead to issues like text-vision decoupling and visual biases in Vision-Language Models (VLMs). The study details a thorough method that covers everything from creating the data to designing the model.

Key facts

  • arXiv:2604.20191v1
  • Published on arXiv
  • Proposes dual-branch gaze prediction framework
  • Constructs G-W3DA dataset
  • Uses multimodal large language model and SAM3
  • Decouples heatmaps into object-level masks
  • Addresses scene-level gaze limitations
  • Targets human-like autonomous driving

Entities

Institutions

  • arXiv

Sources