ARTFEED — Contemporary Art Intelligence

LLMs Fail Autonomous General Practice in New Benchmark

ai-technology · 2026-05-23

A study published on arXiv introduces GPBench, a benchmark designed to evaluate large language models (LLMs) on real-world general practice competencies. Unlike existing exam-style tests, GPBench uses data annotated by domain experts aligned with routine clinical standards. Ten state-of-the-art LLMs were assessed, and findings show they are not suitable for autonomous deployment in clinical general practice. The research highlights a gap between LLM performance on simplified tasks and the complex responsibilities of general practitioners.

Key facts

  • GPBench is a new benchmark for evaluating LLMs in general practice.
  • Data is annotated by domain experts according to routine clinical standards.
  • Ten state-of-the-art LLMs were evaluated.
  • Current LLMs are not suitable for autonomous clinical deployment.
  • Existing benchmarks lack competency-based structure aligned with real-world duties.
  • The study is published on arXiv with ID 2503.17599.
  • The framework assesses LLMs' capability to function as GPs.
  • Findings indicate significant limitations in LLM clinical competencies.

Entities

Institutions

  • arXiv

Sources