Frontier VLMs Fail Clinical Trust Tests in Medical VQA Audit

ai-technology · 2026-05-01

A recent study evaluates five advanced vision-language models (VLMs)—Gemini 2.5 Pro, GPT-5, o3, GLM-4.5V, and Qwen 2.5 VL—focused on medical visual question answering (VQA) tasks, uncovering significant shortcomings in perception and integration processes. The top-performing model records a mere 0.23 mean IoU and 19.1% Acc@0.5 for both anatomical and pathological localization, exhibiting concerning issues with laterality confusion. Implementing a self-grounding pipeline, where the same model performs localization and answering, reduces VQA accuracy across all models due to localization inaccuracies and compliance issues, with parse failures reaching 70%–99% for Gemini and GPT-5 on VQA-RAD. Utilizing ground-truth annotations instead of predicted boxes enhances VQA accuracy. These results highlight the necessity for auditable practices in clinical AI applications.

Key facts

Five frontier VLMs audited: Gemini 2.5 Pro, GPT-5, o3, GLM-4.5V, Qwen 2.5 VL.
Best model achieves only 0.23 mean IoU and 19.1% Acc@0.5 for localization.
Self-grounding pipeline degrades VQA accuracy for all models.
Parse failure rises to 70%–99% for Gemini and GPT-5 on VQA-RAD.
Ground-truth annotations recover and improve VQA accuracy.
Clinically dangerous laterality confusion observed.
Study focuses on Medical VQA trustworthiness.
Published on arXiv (2604.27720).

Frontier VLMs Fail Clinical Trust Tests in Medical VQA Audit

Key facts

Entities

Institutions

Sources