ARTFEED — Contemporary Art Intelligence

PIIBench: New Benchmark Corpus for Personally Identifiable Information Detection in Text

publication · 2026-04-20

A new benchmark corpus named PIIBench has been launched to facilitate the identification of Personally Identifiable Information in natural language text. This comprehensive resource merges ten publicly accessible datasets, resulting in a total of 2,369,883 annotated sequences and 3.35 million mentions of entities categorized into 48 standard PII types. Previously, PII detection resources were scattered among various domain-specific corpora with differing annotation methods, hindering systematic comparisons of detection systems. The datasets include synthetic PII corpora, multilingual Named Entity Recognition benchmarks, and annotated texts from the financial sector. A normalization pipeline was created to align over 80 specific label variants with a unified BIO tagging format, implementing frequency-based suppression of infrequent entity types and generating stratified 80/10/10 train/validation/test splits while maintaining source distribution. Eight detection systems were assessed to set baseline difficulty, addressing the issue of fragmented resources in PII detection research.

Key facts

  • PIIBench is a unified benchmark corpus for Personally Identifiable Information detection
  • The corpus contains 2,369,883 annotated sequences
  • There are 3.35 million entity mentions across 48 canonical PII entity types
  • Ten publicly available datasets were consolidated
  • Datasets include synthetic PII corpora, multilingual NER benchmarks, and financial domain text
  • A normalization pipeline maps 80+ source-specific label variants to standardized BIO tagging
  • Stratified 80/10/10 train/validation/test splits preserve source distribution
  • Eight detection systems were evaluated to establish baseline difficulty

Entities

Sources