LLM Rewriting Defends Against Data Poisoning Attacks
A team of researchers suggests employing large language model (LLM) rewriting as a preemptive strategy to combat backdoor attacks (BAs) resulting from data poisoning. This approach, termed open-book benign rewriting (OBBR), aims to enhance the likelihood of benign outputs by mapping training samples into a safe prompt space. In tests involving five recognized BAs and four popular LLMs, OBBR demonstrated a 51% improvement in safety performance over leading defenses. The findings are detailed in a study available on arXiv with the identifier 2605.19147.
Key facts
- LLMs are highly susceptible to backdoor attacks (BAs) via poisoned training samples.
- Existing defenses are ineffective against many BA patterns.
- OBBR uses open-book benign samples for rewriting.
- OBBR theoretically guarantees higher probability of benign output than closed-book rewriting.
- OBBR neutralizes harmful content by projecting to benign prompt space.
- Tested on five known BAs and four widely used LLMs.
- OBBR increases safety performance by 51% on average.
- Paper published on arXiv with identifier 2605.19147.
Entities
Institutions
- arXiv