SHIELD Dataset and Distilled Small Language Models for Clinical Text De-identification
A team of researchers has unveiled SHIELD (Synthetic Human-annotated Identifier-replaced Entries for Learning and De-identification), which comprises 1,394 clinical notes featuring 10,505 gold-standard spans of Protected Health Information (PHI) categorized into nine distinct groups. This dataset was created through set-cover diversity sampling, supplemented by human-in-the-loop adjudication, to remedy the shortcomings in semantic and demographic diversity found in previous benchmarks like i2b2 2006/2014. The research assessed four Large Language Models, including two proprietary and two open-weight, to determine a performance benchmark, subsequently refining these capabilities into Small Language Models (SLMs) for local deployment, addressing issues like compute expenses and governance limitations on cloud APIs for PHI. Frechet distance was utilized for distributional analysis to evaluate the dataset's representativeness.
Key facts
- SHIELD dataset contains 1,394 notes and 10,505 PHI spans across 9 categories.
- Built via set-cover diversity sampling with human-in-the-loop adjudication.
- Addresses lack of diversity in older benchmarks like i2b2 2006/2014.
- Evaluated four LLMs (two proprietary, two open-weight) for performance ceiling.
- Distilled LLM capabilities into locally deployable Small Language Models (SLMs).
- Enterprise deployment hindered by compute costs and governance restricting PHI from cloud APIs.
- Distributional analysis using Frechet distance was performed.
- Published on arXiv under identifier 2605.03301.
Entities
Institutions
- arXiv