Medmarks: Open-Source LLM Benchmark Suite for Medical Tasks

ai-technology · 2026-05-06

Medmarks, a newly launched open-source framework, aims to assess large language models (LLMs) in healthcare. Developed to tackle issues like benchmark saturation and limited data access, it offers 30 benchmarks targeting various medical tasks, such as clinical reasoning and information extraction. An evaluation of 61 different models across 71 configurations was carried out using established metrics. Findings revealed that models like Gemini 3 Pro Preview and GPT-5.1 demonstrated superior reasoning capabilities, while proprietary models exhibited enhanced token efficiency. Notably, specialized medical models were found to outperform their general counterparts in various tasks.

Key facts

Medmarks is a fully open-source evaluation suite for LLMs in medical tasks.
It includes 30 benchmarks covering QA, information extraction, medical calculations, and clinical reasoning.
61 models across 71 configurations were evaluated.
Frontier reasoning models (Gemini 3 Pro Preview, GPT-5.1, GPT-5.2) achieved highest performance.
Frontier proprietary models are more token-efficient than open-weight alternatives.
Medically fine-tuned models outperform generalist counterparts.

Entities

—

Sources

arXiv cs.AI — 2026-05-05