Semantic Stability in Clinical LLMs: Evaluating Sensitivity to Prompt Variations

ai-technology · 2026-06-01

A new study from arXiv (2605.30646) investigates how Large Language Models (LLMs) respond to semantically equivalent but linguistically varied prompts in clinical settings. The researchers propose a semantic verification framework using Natural Language Inference (NLI) to filter meaning-preserving prompt variations, refined by an LLM-as-a-judge and audited by a clinical expert. They introduce three metrics: Meaning-Preserving Variation Sensitivity (MVS), confidence variation, and others to quantify model sensitivity. The work highlights risks in healthcare where subtle rephrasing can alter predictions, emphasizing the need for robust evaluation methods.

Key facts

arXiv paper 2605.30646
LLMs used in clinical applications
Semantic verification framework based on NLI
LLM-as-a-judge refinement
Clinical expert audit
Three metrics: MVS, confidence variation, etc.
Focus on safety-critical healthcare settings
Addresses embedding-based similarity limitations

Semantic Stability in Clinical LLMs: Evaluating Sensitivity to Prompt Variations

Key facts

Entities

Institutions

Sources