ARTFEED — Contemporary Art Intelligence

Causal Sensitivity Score Reveals Hidden Capability Profiles in Clinical LLMs

ai-technology · 2026-06-01

A recent study has unveiled the Causal Sensitivity Score (CSS), a pre-registered metric designed to assess clinical AI systems by altering oncology tumor-board cases across five significant dimensions: biomarker modifications, failures in prior treatments, removal of biomarkers, changes in surgical status, and variations in stage. The CSS utilizes a scoring system of {0, 0.5, 1.0} to determine if models adjust their recommendations appropriately. Evaluated against the Consensus Match Score (CMS), a coverage-focused weighted recall metric, six advanced models from three different laboratories were tested on 224 cases. Findings reveal that while models may achieve similar scores on coverage-based metrics, their responses to changing patient inputs can differ dramatically. Notably, the model with the lowest CMS score ranked highest in CSS. This study emphasizes the limitations of coverage-based metrics in clinical AI, advocating for the use of interventional metrics like CSS to accurately assess true performance.

Key facts

  • Causal Sensitivity Score (CSS) is a pre-registered interventional metric for clinical AI evaluation.
  • CSS mutates oncology tumor-board cases along five dimensions: biomarker flips, prior-treatment failures, biomarker removals, surgery-status changes, and stage perturbations.
  • Scoring uses a {0, 0.5, 1.0} scale for correct direction of recommendation updates.
  • Six frontier models from three labs were evaluated in single-shot inference across 224 cases.
  • Models with nearly identical coverage-based scores showed radically different behavior under input changes.
  • All six models changed rank between CSS and Consensus Match Score (CMS).
  • The CMS-worst model became the CSS-best model.
  • Coverage-based metrics can mask critical failures in clinical AI.

Entities

Sources