PoisonForge Benchmark Reveals LLM Vulnerability to Task-Level Data Poisoning

ai-technology · 2026-05-25

A new standard known as PoisonForge reveals the susceptibility of instruction-tuned large language models (LLMs) to manipulation via task-level data poisoning. Researchers assess this threat across four factors: type of bias, method of poisoning, frequency of appearance, and length of target output. They analyze 12 open-weight models, varying from 2B to 32B parameters, across five model families, utilizing primarily a 1% poison budget. With only 10 poisoned instances among 1,000 fine-tuning examples, 11 out of 12 models achieve over a 70% attack success rate (ASR) in their most susceptible settings. Leakage to non-target tasks remains under 0.5%, while models perform effectively on standard benchmarks. The findings underscore a significant security vulnerability in the data supply chain for fine-tuning LLMs.

Key facts

PoisonForge is a benchmark for task-level targeted poisoning of instruction-tuned LLMs.
The threat is parameterized along four dimensions: bias type, poisoning mode, appearance count, and target output length.
12 open-weight models from 2B to 32B parameters across five families were evaluated.
A 1% poison budget was used primarily.
10 poisoned examples among 1,000 fine-tuning examples caused 11 of 12 models to exceed 70% ASR.
Unintended leakage to non-target tasks is below 0.5%.
Models maintain performance on standard benchmarks.
The benchmark is introduced in arXiv paper 2605.23168.

PoisonForge Benchmark Reveals LLM Vulnerability to Task-Level Data Poisoning

Key facts

Entities

Institutions

Sources