Visual-Anchored Thinking via Reasoning-Prefix Masking in VLM Distillation

ai-technology · 2026-05-13

A new distillation framework for vision-language models (VLMs) improves student models' reliance on visual evidence by masking salient reasoning prefixes during training. The approach, detailed in arXiv:2605.11651, targets compact think-answer VLMs like Qwen3-VL-Thinking, which use intermediate reasoning steps but suffer high computational costs. The method includes token-wise salient reasoning-prefix masking and self-paced masking strategies to encourage visual anchoring.

Key facts

arXiv:2605.11651 introduces a think-answer distillation framework
Framework masks student's salient reasoning prefixes to encourage visual evidence reliance
Targets compact VLMs like Qwen3-VL-Thinking
Includes token-wise salient reasoning-prefix masking
Includes self-paced masking strategies
Aims to reduce computational cost of think-answer VLMs
Published on arXiv
Announce type: cross

Visual-Anchored Thinking via Reasoning-Prefix Masking in VLM Distillation

Key facts

Entities

Institutions

Sources