GLANCE: Curiosity-Driven Exploration for VLM Agents

ai-technology · 2026-05-07

A novel framework named GLANCE connects reasoning and exploration within vision-language model (VLM) agents by utilizing the gap between verbal predictions and visual truths as a built-in curiosity signal. This method overcomes the challenges of passive reasoning in tasks with sparse rewards, allowing agents to proactively seek out new information. GLANCE integrates the agent's linguistic model with consistent visual representations of a changing target network, employing reinforcement learning to facilitate exploration. This research has been made available on arXiv with the identifier 2605.03782.

Key facts

GLANCE is a framework for VLM agents that uses curiosity-driven exploration.
It bridges reasoning and exploration by grounding linguistic world models into visual representations.
The curiosity signal is based on discrepancy between linguistic prediction and visual reality.
It addresses sparse-reward tasks in partially observable visual environments.
The framework uses reinforcement learning for exploration.
Published on arXiv with identifier 2605.03782.

GLANCE: Curiosity-Driven Exploration for VLM Agents

Key facts

Entities

Institutions

Sources