GazeVLM: Active Vision via Internal Attention Control for Multimodal Reasoning

ai-technology · 2026-05-11

A new AI architecture called GazeVLM introduces active vision into Vision-Language Models by allowing the model to internally control its attention through generated gaze tokens. Unlike traditional VLMs that passively process all visual tokens, GazeVLM dynamically directs focus toward task-relevant details while suppressing irrelevant information, mimicking human metacognitive oversight. This top-down attention mechanism is embedded directly into the reasoning loop, enabling the model to autonomously generate gaze tokens that modify its causal attention mask. The approach aims to reduce linguistic hallucinations and improve spatial reasoning by avoiding the dilution caused by massive token contexts. The paper was published on arXiv with ID 2605.07817.

Key facts

GazeVLM is a multimodal architecture for Vision-Language Models.
It internalizes metacognitive control over attention resources into the reasoning loop.
The model generates gaze tokens to establish top-down control over its causal attention mask.
It dynamically dictates focal intent and triggers suppression bias to dampen irrelevant visual information.
Human active vision involves top-down goal-directed attention with peripheral awareness.
Traditional VLMs process visual information passively via static token accumulation.
The approach aims to reduce linguistic hallucinations and improve spatial reasoning.
The paper is available on arXiv under ID 2605.07817.

GazeVLM: Active Vision via Internal Attention Control for Multimodal Reasoning

Key facts

Entities

Institutions

Sources