Multimodal LLMs Fall Short in Real-World Dermatology
A study evaluating multimodal large language models (MLLMs) on real-world dermatology tasks reveals a significant performance drop compared to public benchmarks. Researchers tested four open-weight models (InternVL-Chat v1.5, LLaVA-Med v1.5, SkinGPT4, MedGemma-4B-Instruct) and GPT-4.1 across three public datasets and a multi-site hospital cohort of 5,811 cases with 46,405 images. On public benchmarks, GPT-4.1 achieved 42.25% top-3 diagnostic accuracy, while the best open-weight model reached 26.55%. In the real-world cohort, performance declined substantially. The study highlights the gap between benchmark success and clinical applicability, emphasizing the need for more rigorous evaluation before deploying MLLMs in dermatology.
Key facts
- Evaluated four open-weight MLLMs and GPT-4.1
- Used three public datasets and a hospital cohort of 5,811 cases
- Hospital cohort included 46,405 clinical images
- GPT-4.1 achieved 42.25% top-3 accuracy on public benchmarks
- Best open-weight model reached 26.55% top-3 accuracy
- Performance declined substantially in real-world cohort
- Tasks included differential diagnosis and severity-based triage
- Models tested: InternVL-Chat v1.5, LLaVA-Med v1.5, SkinGPT4, MedGemma-4B-Instruct, GPT-4.1
Entities
Institutions
- arXiv