Vision-Language Models Face Trustworthiness Crisis
A new paper on arXiv challenges the reliability of Vision-Language Models (VLMs), arguing that current models suffer from functional blindness by relying on language priors rather than grounded visual understanding. The authors propose a Modality Translation Protocol to quantify this issue.
Key facts
- Paper title: The Expense of Seeing: Attaining Trustworthy Multimodal Reasoning Within the Monolithic Paradigm
- Published on arXiv with ID 2604.20665
- Critiques the Vision Encoder-Projector-LLM paradigm
- Claims VLMs exhibit functional blindness
- Proposes Modality Translation Protocol as a solution
- Argues current evaluation methods conflate dataset biases with architectural incapacity
- Takes an information-theoretic approach
- Focuses on trustworthiness of multimodal reasoning
Entities
Institutions
- arXiv