ABRA: New Benchmark Tests AI Agents on Radiology Workflows
ABRA has introduced a new benchmark specifically designed to evaluate AI agents in radiology workflows, requiring the use of the OHIF viewer and the Orthanc DICOM server. The benchmark includes 21 function-calling tools for tasks such as slice navigation, windowing, and annotation, encompassing 655 programmatically generated tasks that vary in difficulty and type. These tasks are derived from sources including LIDC-IDRI, Duke Breast Cancer MRI, and NLST New-Lesion LongCT. Scoring is based on Planning, Execution, and Outcome according to Bluethgen et al. (2025). Ten models demonstrated at least 89% Execution and the study is published on arXiv as 2605.11224v1.
Key facts
- ABRA is a radiology-agent benchmark requiring navigation of OHIF viewer and Orthanc DICOM server
- 21 function-calling tools are used for tasks like slice navigation, windowing, and annotation
- 655 programmatically generated tasks across three difficulty tiers and eight types
- Tasks sourced from LIDC-IDRI, Duke Breast Cancer MRI, and NLST New-Lesion LongCT
- Scoring based on Planning, Execution, and Outcome (Bluethgen et al., 2025)
- Ten models (five closed-weight, five open-weight) achieve at least 89% Execution
- Published on arXiv as 2605.11224v1
- Benchmark addresses limitations of existing medical-agent benchmarks that use pre-selected imaging
Entities
Institutions
- arXiv
- OHIF
- Orthanc
- LIDC-IDRI
- Duke Breast Cancer MRI
- NLST New-Lesion LongCT