LLMs Show Diagnostic Accuracy Drop in Interactive Clinical Reasoning

ai-technology · 2026-05-23

A new study from arXiv (submitted May 2025) evaluates large language models on active evidence-seeking for clinical diagnosis, finding that multi-turn interactions reduce accuracy by 12.75% and evidence quality by 24.36% compared to static full-context benchmarks. The researchers built an OSCE-inspired standardized patient simulator and a controlled benchmark with 468 cases across 15 models. Error analysis attributes declines to premature diagnostic closure and inefficient questioning. The results suggest that static benchmarks overestimate LLM performance in interactive settings, motivating complementary assessment for safer clinical decision support.

Key facts

Study introduces an OSCE-inspired standardized patient simulator for LLM evaluation.
Benchmark includes 468 cases and 15 models.
Multi-turn evidence seeking reduces diagnostic accuracy by 12.75%.
Supporting-evidence quality drops by 24.36% relative to full-context evaluation.
Errors linked to premature diagnostic closure and inefficient questioning.
Static full-context benchmarks may overestimate performance in interactive settings.
Research submitted to arXiv on May 2025.
Study focuses on clinical decision support safety.

LLMs Show Diagnostic Accuracy Drop in Interactive Clinical Reasoning

Key facts

Entities

Institutions

Sources