Text-Guided Dual-Gaze Prediction for Object-Level Driver Attention
Researchers have rolled out a new framework designed to improve how autonomous vehicles predict where to focus their attention on objects. They created the G-W3DA dataset by blending a large multimodal language model with the Segment Anything Model 3 (SAM3). This combination allows them to break down broader heatmaps into specific object-level masks using rigorous cross-validation. This development addresses the limitations of existing datasets, which typically only provide global scene-level gaze. Such limitations can lead to issues like text-vision decoupling and visual biases in Vision-Language Models (VLMs). The study details a thorough method that covers everything from creating the data to designing the model.
Key facts
- arXiv:2604.20191v1
- Published on arXiv
- Proposes dual-branch gaze prediction framework
- Constructs G-W3DA dataset
- Uses multimodal large language model and SAM3
- Decouples heatmaps into object-level masks
- Addresses scene-level gaze limitations
- Targets human-like autonomous driving
Entities
Institutions
- arXiv