ARTFEED — Contemporary Art Intelligence

CHI-Bench: Benchmarking AI Agents for Healthcare Workflows

ai-technology · 2026-05-20

There’s a new benchmark called χ-Bench that measures how well AI agents can fully automate complex healthcare processes. It focuses on three key areas: first, policy density, which is about making decisions based on a wide range of medical, insurance, and operational rules; second, multi-role composition, where agents take on different roles and switch between them; and third, multilateral interaction, which involves back-and-forth conversations like peer reviews and reaching out to patients. The benchmark covers three main areas: provider prior authorization, payer utilization management, and care management. It uses a simulator with 20 healthcare applications and 87 MCP tools, all guided by a handbook with over 1,290 documents, and was tested with 30 different agent models.

Key facts

  • χ-Bench is a benchmark for long-horizon healthcare workflows.
  • It tests policy density, multi-role composition, and multilateral interaction.
  • Covers provider prior authorization, payer utilization management, and care management.
  • Simulator includes 20 healthcare apps and 87 MCP tools.
  • Guided by a 1,290+ document managed-care operations handbook.
  • Tested across 30 agent harnesses and models.
  • Published on arXiv with ID 2605.16679.
  • Announce type is cross.

Entities

Institutions

  • arXiv

Sources