CHI-Bench: Benchmarking AI Agents for Healthcare Workflows

ai-technology · 2026-05-20

There’s a new benchmark called χ-Bench that measures how well AI agents can fully automate complex healthcare processes. It focuses on three key areas: first, policy density, which is about making decisions based on a wide range of medical, insurance, and operational rules; second, multi-role composition, where agents take on different roles and switch between them; and third, multilateral interaction, which involves back-and-forth conversations like peer reviews and reaching out to patients. The benchmark covers three main areas: provider prior authorization, payer utilization management, and care management. It uses a simulator with 20 healthcare applications and 87 MCP tools, all guided by a handbook with over 1,290 documents, and was tested with 30 different agent models.

Key facts

χ-Bench is a benchmark for long-horizon healthcare workflows.
It tests policy density, multi-role composition, and multilateral interaction.
Covers provider prior authorization, payer utilization management, and care management.
Simulator includes 20 healthcare apps and 87 MCP tools.
Guided by a 1,290+ document managed-care operations handbook.
Tested across 30 agent harnesses and models.
Published on arXiv with ID 2605.16679.
Announce type is cross.

CHI-Bench: Benchmarking AI Agents for Healthcare Workflows

Key facts

Entities

Institutions

Sources