Frontier VLMs Fail Clinical Trust Tests in Medical VQA Audit
A recent study evaluates five advanced vision-language models (VLMs)—Gemini 2.5 Pro, GPT-5, o3, GLM-4.5V, and Qwen 2.5 VL—focused on medical visual question answering (VQA) tasks, uncovering significant shortcomings in perception and integration processes. The top-performing model records a mere 0.23 mean IoU and 19.1% Acc@0.5 for both anatomical and pathological localization, exhibiting concerning issues with laterality confusion. Implementing a self-grounding pipeline, where the same model performs localization and answering, reduces VQA accuracy across all models due to localization inaccuracies and compliance issues, with parse failures reaching 70%–99% for Gemini and GPT-5 on VQA-RAD. Utilizing ground-truth annotations instead of predicted boxes enhances VQA accuracy. These results highlight the necessity for auditable practices in clinical AI applications.
Key facts
- Five frontier VLMs audited: Gemini 2.5 Pro, GPT-5, o3, GLM-4.5V, Qwen 2.5 VL.
- Best model achieves only 0.23 mean IoU and 19.1% Acc@0.5 for localization.
- Self-grounding pipeline degrades VQA accuracy for all models.
- Parse failure rises to 70%–99% for Gemini and GPT-5 on VQA-RAD.
- Ground-truth annotations recover and improve VQA accuracy.
- Clinically dangerous laterality confusion observed.
- Study focuses on Medical VQA trustworthiness.
- Published on arXiv (2604.27720).
Entities
Institutions
- arXiv