CodeClinic Benchmark Tests LLM Clinical Reasoning Skills

other · 2026-05-12

There's a new benchmark called CodeClinic that tests how well large language model agents can create and use reusable clinical skills, moving past the restrictions of fixed tool systems. It uses the MIMIC-IV dataset to automate processes like monitoring ICU patients and tracking their health through electronic records. Current methods depend on manually curated tools for identifying sepsis and assessing organ failure, which require a lot of expert upkeep, and zero-shot code generation often leads to flawed reasoning. CodeClinic has two main tasks: longitudinal ICU monitoring, which tracks patient progress, and compositional information seeking, which assesses how well skills integrate to tackle complex questions. This benchmark aims to improve adaptability to specific clinical guidelines while reducing reliance on rigid tool libraries.

Key facts

CodeClinic is a benchmark for evaluating LLM agents in clinical reasoning.
It is built on the MIMIC-IV dataset.
The benchmark has two tasks: longitudinal ICU surveillance and compositional information seeking.
Existing systems rely on manually curated clinical tools for sepsis detection and organ failure assessment.
Maintaining tool libraries requires substantial expert effort.
Zero-shot querying or code generation often produces inefficient and unreliable reasoning chains.
The benchmark tests whether agents can synthesize and compose reusable clinical skills.
It aims to improve adaptability to institution-specific clinical policies.

Entities

—

Sources

arXiv cs.AI — 2026-05-12