ARTFEED — Contemporary Art Intelligence

Selective Safety Trap: LLM Alignment Fails Vulnerable Groups

ai-technology · 2026-04-30

A recent study published on arXiv indicates that the safety alignment of large language models (LLMs) differs significantly among various demographics. The researchers have developed MiJaBench, a bilingual adversarial benchmark in English and Portuguese, consisting of 43,961 jailbreaking prompts aimed at 16 minority groups. Testing 14 leading LLMs resulted in 615,454 prompt-response pairs (MiJaBench-Align), revealing that defense rates can differ by as much as 42% even within the same model. The term 'Selective Safety Trap' highlights how certain populations receive robust defenses, while marginalized communities remain exposed to similar threats. The authors contend that existing safety assessments foster a misleading sense of universal protection by lumping harms into broad categories, such as 'Identity Hate.'

Key facts

  • Study exposes Selective Safety Trap in LLM alignment
  • MiJaBench benchmark contains 43,961 bilingual jailbreaking prompts
  • Covers 16 minority groups across English and Portuguese
  • Evaluated 14 state-of-the-art LLMs
  • Curated 615,454 prompt-response pairs (MiJaBench-Align)
  • Defense rates fluctuate by up to 42% within same model
  • Current safety evaluations aggregate harms under generic categories
  • Paper available on arXiv with ID 2601.04389

Entities

Institutions

  • arXiv

Sources