ARTFEED — Contemporary Art Intelligence

Multimodal LLMs Fall Short in Real-World Dermatology

ai-technology · 2026-05-07

A study evaluating multimodal large language models (MLLMs) on real-world dermatology tasks reveals a significant performance drop compared to public benchmarks. Researchers tested four open-weight models (InternVL-Chat v1.5, LLaVA-Med v1.5, SkinGPT4, MedGemma-4B-Instruct) and GPT-4.1 across three public datasets and a multi-site hospital cohort of 5,811 cases with 46,405 images. On public benchmarks, GPT-4.1 achieved 42.25% top-3 diagnostic accuracy, while the best open-weight model reached 26.55%. In the real-world cohort, performance declined substantially. The study highlights the gap between benchmark success and clinical applicability, emphasizing the need for more rigorous evaluation before deploying MLLMs in dermatology.

Key facts

  • Evaluated four open-weight MLLMs and GPT-4.1
  • Used three public datasets and a hospital cohort of 5,811 cases
  • Hospital cohort included 46,405 clinical images
  • GPT-4.1 achieved 42.25% top-3 accuracy on public benchmarks
  • Best open-weight model reached 26.55% top-3 accuracy
  • Performance declined substantially in real-world cohort
  • Tasks included differential diagnosis and severity-based triage
  • Models tested: InternVL-Chat v1.5, LLaVA-Med v1.5, SkinGPT4, MedGemma-4B-Instruct, GPT-4.1

Entities

Institutions

  • arXiv

Sources