HealthCraft: RL Safety Environment for Emergency Medicine
A new reinforcement-learning environment named HealthCraft has been introduced for assessing advanced language models in the field of emergency medicine. This marks the first publicly available RL environment that incentivizes trajectory-level safety in realistic scenarios, derived from Corecraft. It operates on a FHIR R4 world state featuring 14 entity types and 3,987 seed entities, showcasing 24 MCP tools and employing a dual-layer rubric that nullifies rewards when safety-critical standards are breached. The launch includes 195 tasks divided into six categories, evaluated against 2,255 binary criteria (with 515 being safety-critical), and a post-hoc 10-task negative-class slate that expands to 205 tasks and 2,337 criteria. V8 outcomes for two frontier models indicate Claude Opus 4.6 at an unspecified performance level.
Key facts
- HealthCraft is the first public RL environment for trajectory-level safety in emergency medicine
- Adapted from Corecraft
- Built on FHIR R4 world state with 14 entity types and 3,987 seed entities
- Exposes 24 MCP tools
- Dual-layer rubric zeroes reward when safety-critical criteria violated
- 195 tasks across six categories, graded against 2,255 binary criteria (515 safety-critical)
- Post-hoc 10-task negative-class slate extends to 205 tasks and 2,337 criteria
- V8 results on two frontier models show Claude Opus 4.6
Entities
Institutions
- arXiv
- Corecraft