Causal Sensitivity Score Reveals Hidden Capability Profiles in Clinical LLMs
A recent study has unveiled the Causal Sensitivity Score (CSS), a pre-registered metric designed to assess clinical AI systems by altering oncology tumor-board cases across five significant dimensions: biomarker modifications, failures in prior treatments, removal of biomarkers, changes in surgical status, and variations in stage. The CSS utilizes a scoring system of {0, 0.5, 1.0} to determine if models adjust their recommendations appropriately. Evaluated against the Consensus Match Score (CMS), a coverage-focused weighted recall metric, six advanced models from three different laboratories were tested on 224 cases. Findings reveal that while models may achieve similar scores on coverage-based metrics, their responses to changing patient inputs can differ dramatically. Notably, the model with the lowest CMS score ranked highest in CSS. This study emphasizes the limitations of coverage-based metrics in clinical AI, advocating for the use of interventional metrics like CSS to accurately assess true performance.
Key facts
- Causal Sensitivity Score (CSS) is a pre-registered interventional metric for clinical AI evaluation.
- CSS mutates oncology tumor-board cases along five dimensions: biomarker flips, prior-treatment failures, biomarker removals, surgery-status changes, and stage perturbations.
- Scoring uses a {0, 0.5, 1.0} scale for correct direction of recommendation updates.
- Six frontier models from three labs were evaluated in single-shot inference across 224 cases.
- Models with nearly identical coverage-based scores showed radically different behavior under input changes.
- All six models changed rank between CSS and Consensus Match Score (CMS).
- The CMS-worst model became the CSS-best model.
- Coverage-based metrics can mask critical failures in clinical AI.
Entities
—