LLM-as-a-Judge in Healthcare: Scoping Review Reveals Validation Gaps
A recent scoping review focusing on LLM-as-a-Judge (LaaJ) in the healthcare sector was published on arXiv, encompassing 49 studies sourced from six databases between January 2020 and January 2026, after reviewing 11,727 records. The findings indicated that evaluation and benchmarking were prevalent, appearing in 37 studies (75.5%), with pointwise scoring utilized in 42 studies (85.7%) and GPT-family judges in 36 studies (73.5%). Validation rigor was found to be inadequate: among 36 studies involving human participants, the average number of expert validators was just 3, and 13 studies (26.5%) had none. Notably, 36 studies (73.5%) lacked bias risk assessments, only one (2.0%) addressed demographic fairness, and none evaluated temporal stability or patient context. Deployment was also limited, as no studies reported clinical application. The review proposes the MedJUDGE framework to enhance LaaJ evaluation standards in healthcare.
Key facts
- Scoping review of LLM-as-a-Judge in healthcare published on arXiv.
- Screened 11,727 studies, included 49 from six databases (Jan 2020–Jan 2026).
- 75.5% of studies focused on evaluation and benchmarking.
- 85.7% used pointwise scoring; 73.5% used GPT-family judges.
- Median expert validators among 36 studies with human involvement was 3.
- 26.5% of studies used no human validators.
- 73.5% of studies lacked risk of bias testing.
- Only 2.0% examined demographic fairness; none assessed temporal stability or patient context.
- No studies reported real-world clinical deployment.
- MedJUDGE framework proposed to standardize evaluation.
Entities
Institutions
- arXiv