LLMs Fail Autonomous General Practice in New Benchmark

ai-technology · 2026-05-23

A study published on arXiv introduces GPBench, a benchmark designed to evaluate large language models (LLMs) on real-world general practice competencies. Unlike existing exam-style tests, GPBench uses data annotated by domain experts aligned with routine clinical standards. Ten state-of-the-art LLMs were assessed, and findings show they are not suitable for autonomous deployment in clinical general practice. The research highlights a gap between LLM performance on simplified tasks and the complex responsibilities of general practitioners.

Key facts

GPBench is a new benchmark for evaluating LLMs in general practice.
Data is annotated by domain experts according to routine clinical standards.
Ten state-of-the-art LLMs were evaluated.
Current LLMs are not suitable for autonomous clinical deployment.
Existing benchmarks lack competency-based structure aligned with real-world duties.
The study is published on arXiv with ID 2503.17599.
The framework assesses LLMs' capability to function as GPs.
Findings indicate significant limitations in LLM clinical competencies.

LLMs Fail Autonomous General Practice in New Benchmark

Key facts

Entities

Institutions

Sources