LLMs Fail Autonomous General Practice in New Benchmark
A study published on arXiv introduces GPBench, a benchmark designed to evaluate large language models (LLMs) on real-world general practice competencies. Unlike existing exam-style tests, GPBench uses data annotated by domain experts aligned with routine clinical standards. Ten state-of-the-art LLMs were assessed, and findings show they are not suitable for autonomous deployment in clinical general practice. The research highlights a gap between LLM performance on simplified tasks and the complex responsibilities of general practitioners.
Key facts
- GPBench is a new benchmark for evaluating LLMs in general practice.
- Data is annotated by domain experts according to routine clinical standards.
- Ten state-of-the-art LLMs were evaluated.
- Current LLMs are not suitable for autonomous clinical deployment.
- Existing benchmarks lack competency-based structure aligned with real-world duties.
- The study is published on arXiv with ID 2503.17599.
- The framework assesses LLMs' capability to function as GPs.
- Findings indicate significant limitations in LLM clinical competencies.
Entities
Institutions
- arXiv