ARTFEED — Contemporary Art Intelligence

ABRA: New Benchmark Tests AI Agents on Radiology Workflows

other · 2026-05-13

ABRA has introduced a new benchmark specifically designed to evaluate AI agents in radiology workflows, requiring the use of the OHIF viewer and the Orthanc DICOM server. The benchmark includes 21 function-calling tools for tasks such as slice navigation, windowing, and annotation, encompassing 655 programmatically generated tasks that vary in difficulty and type. These tasks are derived from sources including LIDC-IDRI, Duke Breast Cancer MRI, and NLST New-Lesion LongCT. Scoring is based on Planning, Execution, and Outcome according to Bluethgen et al. (2025). Ten models demonstrated at least 89% Execution and the study is published on arXiv as 2605.11224v1.

Key facts

  • ABRA is a radiology-agent benchmark requiring navigation of OHIF viewer and Orthanc DICOM server
  • 21 function-calling tools are used for tasks like slice navigation, windowing, and annotation
  • 655 programmatically generated tasks across three difficulty tiers and eight types
  • Tasks sourced from LIDC-IDRI, Duke Breast Cancer MRI, and NLST New-Lesion LongCT
  • Scoring based on Planning, Execution, and Outcome (Bluethgen et al., 2025)
  • Ten models (five closed-weight, five open-weight) achieve at least 89% Execution
  • Published on arXiv as 2605.11224v1
  • Benchmark addresses limitations of existing medical-agent benchmarks that use pre-selected imaging

Entities

Institutions

  • arXiv
  • OHIF
  • Orthanc
  • LIDC-IDRI
  • Duke Breast Cancer MRI
  • NLST New-Lesion LongCT

Sources