MedStruct-S Benchmark Targets OCR Clinical Report Extraction

other · 2026-05-07

MedStruct-S has been launched by researchers as a standard for assessing semi-structured information extraction from clinical reports generated by OCR. It focuses on three key tasks: discovering field headers, answering key-conditioned questions, and extracting key-value pairs in an end-to-end manner. The benchmark comprises 3,582 annotated pages from actual clinical reports and evaluates models in the presence of unknown key representations and OCR noise. Two methodologies are tested: encoder-only sequence labeling with subsequent processing and decoder-only structured generation, which includes four encoder-only models and five decoder-only models.

Key facts

MedStruct-S is a benchmark for semi-structured information extraction from OCR clinical reports.
It covers three tasks: field-header discovery, key-conditioned QA, and end-to-end key-value pair extraction.
The benchmark contains 3,582 annotated real-world clinical report pages.
It evaluates models under unknown key representations and OCR noise.
Two paradigms are benchmarked: encoder-only sequence labeling and decoder-only structured generation.
Four encoder-only and five decoder-only models are covered.
The research is published on arXiv with ID 2605.03103.
The goal is to reconstruct patients' longitudinal medical histories.

MedStruct-S Benchmark Targets OCR Clinical Report Extraction

Key facts

Entities

Institutions

Sources