RefusalBench: Benchmark Reveals LLM Refusal Rate Disparities on Biological Research Prompts
RefusalBench is a new standard that evaluates how well large language models handle biological research questions with varying levels of risk. It includes 141 prompts divided into 47 matched sets, ensuring a consistent approach while changing the risk categories (benign, borderline, dual-use). A special section with 15 prompts for positive control helps establish a baseline; interestingly, three models didn't refuse any of these. In a review from May 2026, 19 advanced models showed a wide range of refusal rates, from just 0.1% to 94.6% for identical prompts. While location didn't affect these rates, the model provider did, with Anthropic's API showing a high refusal prediction odds ratio of 21.03. This highlights the need for standardized evaluations.
Key facts
- RefusalBench is a matched-triple benchmark of 141 prompts in 47 bundles
- Prompts vary only by biological risk tier: benign, borderline, dual-use
- A 15-prompt should-refuse positive-control module is included
- Three models failed to refuse even the should-refuse prompts
- 19 frontier models were evaluated in the May 2026 snapshot
- Strict refusal rates span 0.1% to 94.6% on identical prompts
- Jurisdiction does not predict refusal (p = 0.393)
- Anthropic's API stack predicts refusal at OR = 21.03
Entities
Institutions
- Anthropic