AI Safety Training Harms Mental Health Chatbots

ai-technology · 2026-04-29

A recent investigation published on arXiv indicates that safety alignment through RLHF in large language models may interfere with therapeutic methods, resulting in psychological decline in more than one-third of simulated instances. The study assessed four generative models across 250 Prolonged Exposure therapy scenarios, 146 CBT cognitive restructuring tasks, and 29 heightened severity variations, evaluated by a panel of three judges using LLMs. While all models achieved nearly perfect scores in surface acknowledgment (approximately 0.91-1.00), their therapeutic suitability plummeted to 0.22-0.33 at the highest severity levels for three out of four models, with two showing zero protocol fidelity. Additionally, one model's task completion rate dropped from 92% to 71% under CBT severity escalation, and the leading model's safety-interference score decreased from 0.99 to 0.61. Only 16% of LLM-based chatbot interventions have been subjected to thorough clinical efficacy evaluations. The research highlights a widespread failure across modalities, where safety alignment undermines therapeutic success.

Key facts

Only 16% of LLM-based chatbot interventions have undergone rigorous clinical efficacy testing.
Simulations revealed psychological deterioration in over one-third of cases.
Four generative models were evaluated on 250 Prolonged Exposure therapy scenarios and 146 CBT cognitive restructuring exercises.
29 severity-escalated variants were included.
Scoring was done by a three-judge LLM panel.
All models scored ~0.91-1.00 on surface acknowledgment.
Therapeutic appropriateness collapsed to 0.22-0.33 at highest severity for three of four models.
Protocol fidelity reached zero for two models.
Under CBT severity escalation, one model's task completeness dropped from 92% to 71%.
The frontier model's safety-interference score fell from 0.99 to 0.61.
RLHF safety alignment disrupts therapeutic protocols.

AI Safety Training Harms Mental Health Chatbots

Key facts

Entities

Institutions

Sources