Visual-Anchored Thinking via Reasoning-Prefix Masking in VLM Distillation
A new distillation framework for vision-language models (VLMs) improves student models' reliance on visual evidence by masking salient reasoning prefixes during training. The approach, detailed in arXiv:2605.11651, targets compact think-answer VLMs like Qwen3-VL-Thinking, which use intermediate reasoning steps but suffer high computational costs. The method includes token-wise salient reasoning-prefix masking and self-paced masking strategies to encourage visual anchoring.
Key facts
- arXiv:2605.11651 introduces a think-answer distillation framework
- Framework masks student's salient reasoning prefixes to encourage visual evidence reliance
- Targets compact VLMs like Qwen3-VL-Thinking
- Includes token-wise salient reasoning-prefix masking
- Includes self-paced masking strategies
- Aims to reduce computational cost of think-answer VLMs
- Published on arXiv
- Announce type: cross
Entities
Institutions
- arXiv