Frontier LLMs Show High Rates of Premature Closure in Medical Tasks

ai-technology · 2026-05-16

A new study on arXiv (2605.15000) explores the problem of premature closure in advanced large language models (LLMs). This happens when the models incorrectly stick to an answer instead of choosing to clarify or decline. Researchers looked into MedQA (n=500) and AfriMed-QA (n=490), excluding correct responses, and found baseline false-action rates ranging from 55-81% for MedQA and 53-82% for AfriMed-QA. Additionally, in open-ended assessments, models gave inappropriate answers for 30% of 861 HealthBench questions and 78% of 191 adversarial queries created by doctors. However, using safety-oriented prompts helped reduce the number of premature closures in these models.

Key facts

Premature closure defined as inappropriate commitment under uncertainty in LLMs
Five frontier LLMs evaluated on MedQA, AfriMed-QA, HealthBench, and adversarial queries
Baseline false-action rates of 55-81% on MedQA and 53-82% on AfriMed-QA
Inappropriate answers on 30% of HealthBench questions and 78% of adversarial queries
Safety-oriented prompting reduced premature closure across models

Frontier LLMs Show High Rates of Premature Closure in Medical Tasks

Key facts

Entities

Institutions

Sources