ARTFEED — Contemporary Art Intelligence

MedStruct-S Benchmark Targets OCR Clinical Report Extraction

other · 2026-05-07

MedStruct-S has been launched by researchers as a standard for assessing semi-structured information extraction from clinical reports generated by OCR. It focuses on three key tasks: discovering field headers, answering key-conditioned questions, and extracting key-value pairs in an end-to-end manner. The benchmark comprises 3,582 annotated pages from actual clinical reports and evaluates models in the presence of unknown key representations and OCR noise. Two methodologies are tested: encoder-only sequence labeling with subsequent processing and decoder-only structured generation, which includes four encoder-only models and five decoder-only models.

Key facts

  • MedStruct-S is a benchmark for semi-structured information extraction from OCR clinical reports.
  • It covers three tasks: field-header discovery, key-conditioned QA, and end-to-end key-value pair extraction.
  • The benchmark contains 3,582 annotated real-world clinical report pages.
  • It evaluates models under unknown key representations and OCR noise.
  • Two paradigms are benchmarked: encoder-only sequence labeling and decoder-only structured generation.
  • Four encoder-only and five decoder-only models are covered.
  • The research is published on arXiv with ID 2605.03103.
  • The goal is to reconstruct patients' longitudinal medical histories.

Entities

Institutions

  • arXiv

Sources