PIIBench: New Benchmark Corpus for Personally Identifiable Information Detection in Text
A new benchmark corpus named PIIBench has been launched to facilitate the identification of Personally Identifiable Information in natural language text. This comprehensive resource merges ten publicly accessible datasets, resulting in a total of 2,369,883 annotated sequences and 3.35 million mentions of entities categorized into 48 standard PII types. Previously, PII detection resources were scattered among various domain-specific corpora with differing annotation methods, hindering systematic comparisons of detection systems. The datasets include synthetic PII corpora, multilingual Named Entity Recognition benchmarks, and annotated texts from the financial sector. A normalization pipeline was created to align over 80 specific label variants with a unified BIO tagging format, implementing frequency-based suppression of infrequent entity types and generating stratified 80/10/10 train/validation/test splits while maintaining source distribution. Eight detection systems were assessed to set baseline difficulty, addressing the issue of fragmented resources in PII detection research.
Key facts
- PIIBench is a unified benchmark corpus for Personally Identifiable Information detection
- The corpus contains 2,369,883 annotated sequences
- There are 3.35 million entity mentions across 48 canonical PII entity types
- Ten publicly available datasets were consolidated
- Datasets include synthetic PII corpora, multilingual NER benchmarks, and financial domain text
- A normalization pipeline maps 80+ source-specific label variants to standardized BIO tagging
- Stratified 80/10/10 train/validation/test splits preserve source distribution
- Eight detection systems were evaluated to establish baseline difficulty
Entities
—