New Benchmark Evaluates Safety Risks of Malicious Knowledge Editing in LLMs

ai-technology · 2026-05-12

Researchers have introduced EditRisk-Bench, a benchmark designed to systematically evaluate the safety risks of knowledge-intensive reasoning in large language models (LLMs) under malicious knowledge editing. Unlike existing benchmarks that focus on editing efficacy, EditRisk-Bench specifically assesses how injected knowledge—such as misinformation, bias, or safety violations—affects downstream reasoning behavior and reliability. The work, published on arXiv (2605.10146), addresses the gap in unified frameworks for evaluating safety implications of edited knowledge. The benchmark integrates diverse malicious scenarios to test LLMs' vulnerability to adversarial knowledge injection, which can corrupt reasoning and lead to harmful outcomes.

Key facts

EditRisk-Bench is a new benchmark for evaluating safety risks of knowledge-intensive reasoning under malicious knowledge editing.
The benchmark focuses on how injected knowledge affects downstream reasoning behavior and reliability.
It includes scenarios such as misinformation, bias, and safety violations.
Existing benchmarks primarily emphasize edit success, generalization, and locality.
The research was published on arXiv with identifier 2605.10146.
Large language models increasingly rely on knowledge editing for knowledge-intensive reasoning.
Adversaries can inject malicious knowledge that corrupts reasoning and leads to harmful outcomes.
EditRisk-Bench aims to provide a unified framework for safety evaluation.

New Benchmark Evaluates Safety Risks of Malicious Knowledge Editing in LLMs

Key facts

Entities

Institutions

Sources