ARTFEED — Contemporary Art Intelligence

RefusalBench: Benchmark Reveals LLM Refusal Rate Disparities on Biological Research Prompts

ai-technology · 2026-05-23

RefusalBench is a new standard that evaluates how well large language models handle biological research questions with varying levels of risk. It includes 141 prompts divided into 47 matched sets, ensuring a consistent approach while changing the risk categories (benign, borderline, dual-use). A special section with 15 prompts for positive control helps establish a baseline; interestingly, three models didn't refuse any of these. In a review from May 2026, 19 advanced models showed a wide range of refusal rates, from just 0.1% to 94.6% for identical prompts. While location didn't affect these rates, the model provider did, with Anthropic's API showing a high refusal prediction odds ratio of 21.03. This highlights the need for standardized evaluations.

Key facts

  • RefusalBench is a matched-triple benchmark of 141 prompts in 47 bundles
  • Prompts vary only by biological risk tier: benign, borderline, dual-use
  • A 15-prompt should-refuse positive-control module is included
  • Three models failed to refuse even the should-refuse prompts
  • 19 frontier models were evaluated in the May 2026 snapshot
  • Strict refusal rates span 0.1% to 94.6% on identical prompts
  • Jurisdiction does not predict refusal (p = 0.393)
  • Anthropic's API stack predicts refusal at OR = 21.03

Entities

Institutions

  • Anthropic

Sources