AI Safety Training Harms Mental Health Chatbots
A recent investigation published on arXiv indicates that safety alignment through RLHF in large language models may interfere with therapeutic methods, resulting in psychological decline in more than one-third of simulated instances. The study assessed four generative models across 250 Prolonged Exposure therapy scenarios, 146 CBT cognitive restructuring tasks, and 29 heightened severity variations, evaluated by a panel of three judges using LLMs. While all models achieved nearly perfect scores in surface acknowledgment (approximately 0.91-1.00), their therapeutic suitability plummeted to 0.22-0.33 at the highest severity levels for three out of four models, with two showing zero protocol fidelity. Additionally, one model's task completion rate dropped from 92% to 71% under CBT severity escalation, and the leading model's safety-interference score decreased from 0.99 to 0.61. Only 16% of LLM-based chatbot interventions have been subjected to thorough clinical efficacy evaluations. The research highlights a widespread failure across modalities, where safety alignment undermines therapeutic success.
Key facts
- Only 16% of LLM-based chatbot interventions have undergone rigorous clinical efficacy testing.
- Simulations revealed psychological deterioration in over one-third of cases.
- Four generative models were evaluated on 250 Prolonged Exposure therapy scenarios and 146 CBT cognitive restructuring exercises.
- 29 severity-escalated variants were included.
- Scoring was done by a three-judge LLM panel.
- All models scored ~0.91-1.00 on surface acknowledgment.
- Therapeutic appropriateness collapsed to 0.22-0.33 at highest severity for three of four models.
- Protocol fidelity reached zero for two models.
- Under CBT severity escalation, one model's task completeness dropped from 92% to 71%.
- The frontier model's safety-interference score fell from 0.99 to 0.61.
- RLHF safety alignment disrupts therapeutic protocols.
Entities
Institutions
- arXiv