Causal Analysis Reveals Regional Bias in LLM Safety Guardrails
Recent research has developed a new framework utilizing a Probabilistic Graphical Model to conduct causal audits on the safety features of large language models. The study, which employs Pearl's do-operator, examines how cultural demographics influence user prompts. Researchers evaluated seven instruction-tuned models from various regions, including the US, Europe, UAE, China, and India, using datasets like ToxiGen and BOLD. The results revealed significant differences between observational and causal biases, indicating that traditional fairness evaluations are affected by the relationships between demographic factors and topics. This emphasizes the importance of adopting causal approaches for enhancing AI safety on a global scale.
Key facts
- Study introduces a PGM framework for causal audit of LLM safety
- Uses Pearl's do-operator to isolate causal effect of cultural demographics
- Analyzes seven models from US, Europe, UAE, China, and India
- Models include Llama-3.1-8B, Gemma-2-9B, Mistral-7B-v0.3, Falcon3-7B, Qwen2.5-7B, DeepSeek-7B, Airavata-7B
- Datasets used: ToxiGen and BOLD
- Finds disparity between observational and causal bias measurements
- Current fairness evaluations confounded by topic-demographic correlations
- Published on arXiv with ID 2605.05427
Entities
Institutions
- arXiv
Locations
- United States
- Europe
- UAE
- China
- India