ABRA: New Benchmark Tests AI Agents on Radiology Workflows

other · 2026-05-13

ABRA has introduced a new benchmark specifically designed to evaluate AI agents in radiology workflows, requiring the use of the OHIF viewer and the Orthanc DICOM server. The benchmark includes 21 function-calling tools for tasks such as slice navigation, windowing, and annotation, encompassing 655 programmatically generated tasks that vary in difficulty and type. These tasks are derived from sources including LIDC-IDRI, Duke Breast Cancer MRI, and NLST New-Lesion LongCT. Scoring is based on Planning, Execution, and Outcome according to Bluethgen et al. (2025). Ten models demonstrated at least 89% Execution and the study is published on arXiv as 2605.11224v1.

Key facts

ABRA is a radiology-agent benchmark requiring navigation of OHIF viewer and Orthanc DICOM server
21 function-calling tools are used for tasks like slice navigation, windowing, and annotation
655 programmatically generated tasks across three difficulty tiers and eight types
Tasks sourced from LIDC-IDRI, Duke Breast Cancer MRI, and NLST New-Lesion LongCT
Scoring based on Planning, Execution, and Outcome (Bluethgen et al., 2025)
Ten models (five closed-weight, five open-weight) achieve at least 89% Execution
Published on arXiv as 2605.11224v1
Benchmark addresses limitations of existing medical-agent benchmarks that use pre-selected imaging

Entities

Institutions

arXiv
OHIF
Orthanc
LIDC-IDRI
Duke Breast Cancer MRI
NLST New-Lesion LongCT

Sources

arXiv cs.AI — 2026-05-13