Glance-or-Gaze: AI Framework for Adaptive Visual Search
Researchers have introduced Glance-or-Gaze (GoG), an autonomous system that transitions extensive multimodal models from merely observing to engaging in active visual planning. GoG features a Selective Gaze mechanism that intelligently decides whether to focus on the broader context or to concentrate on high-value areas, effectively filtering out irrelevant data prior to retrieval. To improve performance on intricate visual queries, a dual-stage training method known as Reflective GoG Behavior Alignment has been developed. This work tackles the challenges posed by the limitations of static parametric knowledge in large multimodal models and the indiscriminate retrieval of entire images in search-augmented methodologies. The paper can be found on arXiv with ID 2601.13942.
Key facts
- GoG is a fully autonomous framework for large multimodal models.
- It introduces a Selective Gaze mechanism for adaptive visual focus.
- The framework shifts from passive perception to active visual planning.
- It filters irrelevant information before retrieval.
- A dual-stage training strategy called Reflective GoG Behavior Alignment is used.
- The paper addresses limitations of static parametric knowledge in LMMs.
- It overcomes issues with indiscriminate whole-image retrieval.
- The paper is available on arXiv with ID 2601.13942.
Entities
Institutions
- arXiv