Study Reveals LLMs Lack Empathy and Readability in Clinical Settings
A new study from arXiv (2604.20791) evaluates how well large language models (LLMs) align with clinical communication standards. Researchers analyzed general-purpose and domain-specialized LLMs on medical explanations and real doctor-patient interactions, measuring semantic fidelity, readability, and affective resonance. Baseline models showed amplified negative affect (43.14-45.10% very negative vs. physicians' 37.25%) and higher linguistic complexity (FKGL up to 17.60 vs. 11.50 for doctors). Larger models like GPT-5 and Claude performed worse. Empathy prompts reduced negativity and grade-level complexity (up to -6.87 FKGL for GPT-5) but didn't improve semantic fidelity. Collaborative rewriting achieved the best overall alignment. The study highlights current LLM limitations in healthcare communication.
Key facts
- arXiv:2604.20791 evaluates LLMs in healthcare communication
- Baseline models show amplified affective polarity (43.14-45.10% very negative vs. 37.25%)
- Larger architectures (GPT-5, Claude) produce higher linguistic complexity (FKGL up to 17.60 vs. 11.50)
- Empathy-oriented prompting reduces extreme negativity and grade-level complexity (up to -6.87 FKGL for GPT-5)
- Collaborative rewriting yields strongest overall alignment
- Study analyzes semantic fidelity, readability, and affective resonance
- General-purpose and domain-specialized LLMs were tested
- Empathy prompts do not significantly increase semantic fidelity
Entities
Institutions
- arXiv