MedStruct-S Benchmark Targets OCR Clinical Report Extraction
MedStruct-S has been launched by researchers as a standard for assessing semi-structured information extraction from clinical reports generated by OCR. It focuses on three key tasks: discovering field headers, answering key-conditioned questions, and extracting key-value pairs in an end-to-end manner. The benchmark comprises 3,582 annotated pages from actual clinical reports and evaluates models in the presence of unknown key representations and OCR noise. Two methodologies are tested: encoder-only sequence labeling with subsequent processing and decoder-only structured generation, which includes four encoder-only models and five decoder-only models.
Key facts
- MedStruct-S is a benchmark for semi-structured information extraction from OCR clinical reports.
- It covers three tasks: field-header discovery, key-conditioned QA, and end-to-end key-value pair extraction.
- The benchmark contains 3,582 annotated real-world clinical report pages.
- It evaluates models under unknown key representations and OCR noise.
- Two paradigms are benchmarked: encoder-only sequence labeling and decoder-only structured generation.
- Four encoder-only and five decoder-only models are covered.
- The research is published on arXiv with ID 2605.03103.
- The goal is to reconstruct patients' longitudinal medical histories.
Entities
Institutions
- arXiv