Multimodal LLMs Fall Short in Real-World Dermatology

ai-technology · 2026-05-07

A study evaluating multimodal large language models (MLLMs) on real-world dermatology tasks reveals a significant performance drop compared to public benchmarks. Researchers tested four open-weight models (InternVL-Chat v1.5, LLaVA-Med v1.5, SkinGPT4, MedGemma-4B-Instruct) and GPT-4.1 across three public datasets and a multi-site hospital cohort of 5,811 cases with 46,405 images. On public benchmarks, GPT-4.1 achieved 42.25% top-3 diagnostic accuracy, while the best open-weight model reached 26.55%. In the real-world cohort, performance declined substantially. The study highlights the gap between benchmark success and clinical applicability, emphasizing the need for more rigorous evaluation before deploying MLLMs in dermatology.

Key facts

Evaluated four open-weight MLLMs and GPT-4.1
Used three public datasets and a hospital cohort of 5,811 cases
Hospital cohort included 46,405 clinical images
GPT-4.1 achieved 42.25% top-3 accuracy on public benchmarks
Best open-weight model reached 26.55% top-3 accuracy
Performance declined substantially in real-world cohort
Tasks included differential diagnosis and severity-based triage
Models tested: InternVL-Chat v1.5, LLaVA-Med v1.5, SkinGPT4, MedGemma-4B-Instruct, GPT-4.1

Multimodal LLMs Fall Short in Real-World Dermatology

Key facts

Entities

Institutions

Sources