ARTFEED — Contemporary Art Intelligence

Frontier LLMs Show High Rates of Premature Closure in Medical Tasks

ai-technology · 2026-05-16

A new study on arXiv (2605.15000) explores the problem of premature closure in advanced large language models (LLMs). This happens when the models incorrectly stick to an answer instead of choosing to clarify or decline. Researchers looked into MedQA (n=500) and AfriMed-QA (n=490), excluding correct responses, and found baseline false-action rates ranging from 55-81% for MedQA and 53-82% for AfriMed-QA. Additionally, in open-ended assessments, models gave inappropriate answers for 30% of 861 HealthBench questions and 78% of 191 adversarial queries created by doctors. However, using safety-oriented prompts helped reduce the number of premature closures in these models.

Key facts

  • Premature closure defined as inappropriate commitment under uncertainty in LLMs
  • Five frontier LLMs evaluated on MedQA, AfriMed-QA, HealthBench, and adversarial queries
  • Baseline false-action rates of 55-81% on MedQA and 53-82% on AfriMed-QA
  • Inappropriate answers on 30% of HealthBench questions and 78% of adversarial queries
  • Safety-oriented prompting reduced premature closure across models

Entities

Institutions

  • arXiv

Sources