ARTFEED — Contemporary Art Intelligence

Semantic Stability in Clinical LLMs: Evaluating Sensitivity to Prompt Variations

ai-technology · 2026-06-01

A new study from arXiv (2605.30646) investigates how Large Language Models (LLMs) respond to semantically equivalent but linguistically varied prompts in clinical settings. The researchers propose a semantic verification framework using Natural Language Inference (NLI) to filter meaning-preserving prompt variations, refined by an LLM-as-a-judge and audited by a clinical expert. They introduce three metrics: Meaning-Preserving Variation Sensitivity (MVS), confidence variation, and others to quantify model sensitivity. The work highlights risks in healthcare where subtle rephrasing can alter predictions, emphasizing the need for robust evaluation methods.

Key facts

  • arXiv paper 2605.30646
  • LLMs used in clinical applications
  • Semantic verification framework based on NLI
  • LLM-as-a-judge refinement
  • Clinical expert audit
  • Three metrics: MVS, confidence variation, etc.
  • Focus on safety-critical healthcare settings
  • Addresses embedding-based similarity limitations

Entities

Institutions

  • arXiv

Sources