Selective Safety Trap: LLM Alignment Fails Vulnerable Groups

ai-technology · 2026-04-30

A recent study published on arXiv indicates that the safety alignment of large language models (LLMs) differs significantly among various demographics. The researchers have developed MiJaBench, a bilingual adversarial benchmark in English and Portuguese, consisting of 43,961 jailbreaking prompts aimed at 16 minority groups. Testing 14 leading LLMs resulted in 615,454 prompt-response pairs (MiJaBench-Align), revealing that defense rates can differ by as much as 42% even within the same model. The term 'Selective Safety Trap' highlights how certain populations receive robust defenses, while marginalized communities remain exposed to similar threats. The authors contend that existing safety assessments foster a misleading sense of universal protection by lumping harms into broad categories, such as 'Identity Hate.'

Key facts

Study exposes Selective Safety Trap in LLM alignment
MiJaBench benchmark contains 43,961 bilingual jailbreaking prompts
Covers 16 minority groups across English and Portuguese
Evaluated 14 state-of-the-art LLMs
Curated 615,454 prompt-response pairs (MiJaBench-Align)
Defense rates fluctuate by up to 42% within same model
Current safety evaluations aggregate harms under generic categories
Paper available on arXiv with ID 2601.04389

Selective Safety Trap: LLM Alignment Fails Vulnerable Groups

Key facts

Entities

Institutions

Sources