PIIBench: New Benchmark Corpus for Personally Identifiable Information Detection in Text

publication · 2026-04-20

A new benchmark corpus named PIIBench has been launched to facilitate the identification of Personally Identifiable Information in natural language text. This comprehensive resource merges ten publicly accessible datasets, resulting in a total of 2,369,883 annotated sequences and 3.35 million mentions of entities categorized into 48 standard PII types. Previously, PII detection resources were scattered among various domain-specific corpora with differing annotation methods, hindering systematic comparisons of detection systems. The datasets include synthetic PII corpora, multilingual Named Entity Recognition benchmarks, and annotated texts from the financial sector. A normalization pipeline was created to align over 80 specific label variants with a unified BIO tagging format, implementing frequency-based suppression of infrequent entity types and generating stratified 80/10/10 train/validation/test splits while maintaining source distribution. Eight detection systems were assessed to set baseline difficulty, addressing the issue of fragmented resources in PII detection research.

Key facts

PIIBench is a unified benchmark corpus for Personally Identifiable Information detection
The corpus contains 2,369,883 annotated sequences
There are 3.35 million entity mentions across 48 canonical PII entity types
Ten publicly available datasets were consolidated
Datasets include synthetic PII corpora, multilingual NER benchmarks, and financial domain text
A normalization pipeline maps 80+ source-specific label variants to standardized BIO tagging
Stratified 80/10/10 train/validation/test splits preserve source distribution
Eight detection systems were evaluated to establish baseline difficulty

Entities

—

Sources

arXiv cs.AI — 2026-04-20