RealICU Benchmark Tests LLM Reasoning on Long ICU Data
A new benchmark called RealICU has been developed by researchers to assess large language models (LLMs) using extensive ICU data, where labels are assigned following a thorough review of complete patient histories by senior physicians. This benchmark overcomes the shortcomings of current ICU benchmarks that rely on historical clinician actions as definitive, which can lead to suboptimal results due to missing information. RealICU outlines four tasks driven by physician needs: evaluating Patient Status, identifying Acute Problems, suggesting Recommended Actions, and recognizing Red Flag actions that could lead to unsafe outcomes. Each patient trajectory is segmented for assessment. This research is available on arXiv (2605.13542).
Key facts
- RealICU is a hindsight-annotated benchmark for LLMs under realistic ICU conditions.
- Labels are created after senior physicians review the full patient trajectory.
- Four tasks: Patient Status, Acute Problems, Recommended Actions, Red Flag actions.
- Existing ICU benchmarks treat historical clinician actions as ground truth.
- Clinician actions may be suboptimal due to incomplete information.
- The benchmark partitions each patient trajectory.
- Published on arXiv with ID 2605.13542.
- Aims to assess true reasoning capabilities of AI systems.
Entities
Institutions
- arXiv