RxEval: Benchmarking LLM Medication Recommendation

ai-technology · 2026-05-16

A new benchmark called RxEval has been developed by researchers to assess large language models (LLMs) in the context of inpatient medication recommendations. Unlike previous benchmarks that focus on broad drug codes at the admission level, RxEval targets the prescription level through multiple-choice questions. Each question includes a comprehensive patient profile and a chronological clinical history, requiring the selection of precise medication-dose-route combinations from actual prescriptions, alongside patient-specific distractors created through reasoning-chain perturbation. The benchmark features 1,547 questions across 584 patients, covering 18 diagnostic categories and 969 distinct medications. An evaluation of 16 LLMs reveals that RxEval is both demanding and discerning, with F1 scores between 0.2 and 0.6, underscoring the gap between current LLM performance and the needs of clinical decision-making.

Key facts

RxEval is a prescription-level benchmark for LLM medication recommendation.
It uses multiple-choice questions with patient profiles and clinical trajectories.
The benchmark includes 1,547 questions, 584 patients, 18 diagnostic categories, and 969 medications.
16 LLMs were evaluated, with F1 scores ranging from 0.2 to 0.6.
Existing benchmarks fail to capture per-timepoint prescribing decisions.
Distractors are generated via reasoning-chain perturbation.
The task requires selecting medication-dose-route triples.
RxEval reveals a gap between LLM performance and clinical needs.

Entities

—

Sources

arXiv cs.AI — 2026-05-16