DESG: Evaluating AI Therapist Responses Without LLM Judges
Researchers propose Dynamic Emotional Signature Graphs (DESG) to evaluate therapeutic response quality in mental-health dialogue systems, addressing the failure of direct LLM judges and text-similarity metrics. The study finds that these conventional methods are poorly aligned with clinical direction—whether a response moves the user toward regulation, leaves them unchanged, or reinforces deterioration. DESG is a model-agnostic evaluator that represents dialogue windows with decoupled clinical features, offering a more reliable offline evaluation for AI therapists. The paper is published on arXiv under ID 2605.03472.
Key facts
- Conversational AI therapists are increasingly used in psychological support settings.
- Reliable offline evaluation of therapeutic response quality remains an open problem.
- The paper studies multi-domain support-dialogue evaluation without relying on LLMs as final judges.
- Direct LLM judges and symmetric text-similarity metrics are poorly aligned with therapeutic quality.
- The target label depends on clinical direction: regulation, reframing, unchanged, or deterioration.
- DESG represents dialogue windows with decoupled clinical features.
- DESG is a model-agnostic evaluator.
- The paper is on arXiv:2605.03472.
Entities
Institutions
- arXiv