Self-Captioning Method Boosts Vision Language Model Robustness
A new arXiv paper (2605.08145) proposes a self-captioning workflow to improve vision language model robustness against hallucination and corrupted modalities. The approach amplifies redundant multimodal interactions—shared information between vision and language—to compensate for impaired modalities. A Multimodal Interaction Gate converts unique interactions into redundant ones, increasing exploitable shared information. The authors find that modern instruction datasets often eliminate redundancies for visual grounding, which this method addresses. Increasing redundancy reduces visual-induced errors.
Key facts
- arXiv paper ID: 2605.08145
- Addresses hallucination and robustness in vision language models
- Exploits shared information between modalities
- Introduces Multimodal Interaction Gate
- Converts unique interactions into redundant interactions
- Modern instruction datasets reduce redundancies
- Increasing redundancy reduces visual-induced errors
Entities
Institutions
- arXiv