FakeWiki Benchmark for Language Model Provenance
A new arXiv paper introduces DataDignity, a framework for training data attribution in large language models. The authors propose pinpoint provenance, a task to identify which source document supports a model's response. They created FakeWiki, a benchmark of 3,537 fabricated Wikipedia-style articles with ground-truth provenance. The benchmark includes QA probes, paraphrases, retro-generated variants, and hard anti-documents. Five query conditions are tested: clean prompting and four jailbreak-inspired transformations. The study evaluates seven retrieval baselines, a training-free method called SteerFuse, and a supervised contrastive ranker, ScoringModel.
Key facts
- DataDignity addresses training data attribution for LLMs.
- Pinpoint provenance ranks documents supporting a model response.
- FakeWiki contains 3,537 fabricated Wikipedia-style articles.
- FakeWiki includes QA probes, paraphrases, retro-generated variants, and hard anti-documents.
- Five query conditions: clean prompting and four jailbreak-inspired transformations.
- Seven retrieval baselines evaluated.
- SteerFuse is a training-free activation-steering retrieval-fusion method.
- ScoringModel is a supervised contrastive provenance ranker.
Entities
Institutions
- arXiv