PoisonForge Benchmark Reveals LLM Vulnerability to Task-Level Data Poisoning
A new standard known as PoisonForge reveals the susceptibility of instruction-tuned large language models (LLMs) to manipulation via task-level data poisoning. Researchers assess this threat across four factors: type of bias, method of poisoning, frequency of appearance, and length of target output. They analyze 12 open-weight models, varying from 2B to 32B parameters, across five model families, utilizing primarily a 1% poison budget. With only 10 poisoned instances among 1,000 fine-tuning examples, 11 out of 12 models achieve over a 70% attack success rate (ASR) in their most susceptible settings. Leakage to non-target tasks remains under 0.5%, while models perform effectively on standard benchmarks. The findings underscore a significant security vulnerability in the data supply chain for fine-tuning LLMs.
Key facts
- PoisonForge is a benchmark for task-level targeted poisoning of instruction-tuned LLMs.
- The threat is parameterized along four dimensions: bias type, poisoning mode, appearance count, and target output length.
- 12 open-weight models from 2B to 32B parameters across five families were evaluated.
- A 1% poison budget was used primarily.
- 10 poisoned examples among 1,000 fine-tuning examples caused 11 of 12 models to exceed 70% ASR.
- Unintended leakage to non-target tasks is below 0.5%.
- Models maintain performance on standard benchmarks.
- The benchmark is introduced in arXiv paper 2605.23168.
Entities
Institutions
- arXiv