RefusalBench: Benchmark Reveals LLM Refusal Rate Disparities on Biological Research Prompts

ai-technology · 2026-05-23

RefusalBench is a new standard that evaluates how well large language models handle biological research questions with varying levels of risk. It includes 141 prompts divided into 47 matched sets, ensuring a consistent approach while changing the risk categories (benign, borderline, dual-use). A special section with 15 prompts for positive control helps establish a baseline; interestingly, three models didn't refuse any of these. In a review from May 2026, 19 advanced models showed a wide range of refusal rates, from just 0.1% to 94.6% for identical prompts. While location didn't affect these rates, the model provider did, with Anthropic's API showing a high refusal prediction odds ratio of 21.03. This highlights the need for standardized evaluations.

Key facts

RefusalBench is a matched-triple benchmark of 141 prompts in 47 bundles
Prompts vary only by biological risk tier: benign, borderline, dual-use
A 15-prompt should-refuse positive-control module is included
Three models failed to refuse even the should-refuse prompts
19 frontier models were evaluated in the May 2026 snapshot
Strict refusal rates span 0.1% to 94.6% on identical prompts
Jurisdiction does not predict refusal (p = 0.393)
Anthropic's API stack predicts refusal at OR = 21.03

RefusalBench: Benchmark Reveals LLM Refusal Rate Disparities on Biological Research Prompts

Key facts

Entities

Institutions

Sources