Study Reveals LLMs Lack Empathy and Readability in Clinical Settings

ai-technology · 2026-04-24

A new study from arXiv (2604.20791) evaluates how well large language models (LLMs) align with clinical communication standards. Researchers analyzed general-purpose and domain-specialized LLMs on medical explanations and real doctor-patient interactions, measuring semantic fidelity, readability, and affective resonance. Baseline models showed amplified negative affect (43.14-45.10% very negative vs. physicians' 37.25%) and higher linguistic complexity (FKGL up to 17.60 vs. 11.50 for doctors). Larger models like GPT-5 and Claude performed worse. Empathy prompts reduced negativity and grade-level complexity (up to -6.87 FKGL for GPT-5) but didn't improve semantic fidelity. Collaborative rewriting achieved the best overall alignment. The study highlights current LLM limitations in healthcare communication.

Key facts

arXiv:2604.20791 evaluates LLMs in healthcare communication
Baseline models show amplified affective polarity (43.14-45.10% very negative vs. 37.25%)
Larger architectures (GPT-5, Claude) produce higher linguistic complexity (FKGL up to 17.60 vs. 11.50)
Empathy-oriented prompting reduces extreme negativity and grade-level complexity (up to -6.87 FKGL for GPT-5)
Collaborative rewriting yields strongest overall alignment
Study analyzes semantic fidelity, readability, and affective resonance
General-purpose and domain-specialized LLMs were tested
Empathy prompts do not significantly increase semantic fidelity

Study Reveals LLMs Lack Empathy and Readability in Clinical Settings

Key facts

Entities

Institutions

Sources