PhysicianBench: New Benchmark for LLM Agents in EHR Environments

ai-technology · 2026-05-06

A new benchmark called PhysicianBench has been developed by researchers to assess large language model (LLM) agents on tasks performed by physicians in electronic health record (EHR) settings. Unlike current benchmarks that emphasize static knowledge or isolated actions, PhysicianBench focuses on comprehensive workflows that reflect actual clinical practices. It includes 100 tasks derived from authentic consultation scenarios involving both primary care and subspecialty physicians, with each task evaluated by a distinct panel of doctors. These tasks are implemented in an EHR environment utilizing real patient data and accessed via standard APIs from commercial EHR providers. They cover 21 specialties, including cardiology, endocrinology, oncology, and psychiatry, and encompass various workflow types such as diagnosis interpretation.

Key facts

PhysicianBench evaluates LLM agents on physician tasks in EHR environments.
It comprises 100 long-horizon tasks from real consultation cases.
Tasks are independently reviewed by a panel of physicians.
Tasks use real patient records and standard EHR APIs.
Tasks span 21 specialties including cardiology, endocrinology, oncology, psychiatry.
Existing benchmarks fail to capture long-horizon composite workflows.
PhysicianBench addresses the gap in evaluating LLM agents on real clinical workflows.
The benchmark is introduced in arXiv paper 2605.02240.

PhysicianBench: New Benchmark for LLM Agents in EHR Environments

Key facts

Entities

Institutions

Sources