CLEAR Framework Reveals LLM Reliability Issues in Medical Contexts

other · 2026-05-06

A new framework called CLEAR (CLinical Evaluation of Ambiguity and Reliability) exposes how noise and ambiguity degrade large language model (LLM) performance in medical benchmarks. Developed by researchers and published on arXiv, CLEAR systematically perturbs answer option count, ground truth presence, and semantic framing across three benchmarks and 17 LLMs. Results show that increasing plausible answers reduces accuracy and abstention ability, especially when abstention framing shifts from assertive rejection to uncertain phrasing. The study highlights limitations in current evaluation methods that fail to reflect real-world medical ambiguity.

Key facts

CLEAR framework introduced to assess LLM reliability under ambiguity
Evaluated on three benchmarks across 17 LLMs
Increasing plausible answers degrades correct answer identification
Abstention framing affects model caution
Published on arXiv with ID 2605.01011

CLEAR Framework Reveals LLM Reliability Issues in Medical Contexts

Key facts

Entities

Institutions

Sources