CAREBench: New Benchmark Evaluates LLMs' Emotion Understanding via Appraisal Reasoning
Researchers have unveiled CAREBench, a pioneering benchmark aimed at assessing the emotional comprehension of large language models (LLMs) through the lens of cognitive appraisal reasoning. Based on appraisal theory, CAREBench offers comprehensive annotations of inferential chains from both first- and third-person viewpoints on real-world stories, addressing appraisal reasoning, appraisal ratings, and multi-label emotion tagging. The study introduces a framework for process-level evaluation and executes systematic experiments involving six LLMs centered on four research inquiries. Results indicate that while more advanced models can match or exceed human performance in some areas, they struggle with appraisal reasoning and recognizing positive emotions. There are notable discrepancies in performance across different chain steps and sensitivity to appraisal interventions, underscoring the incomplete integration of full appraisal reasoning in current models.
Key facts
- CAREBench is the first benchmark with complete inferential chain annotations for emotion understanding.
- The benchmark is grounded in appraisal theory.
- Annotations cover first- and third-person perspectives on real-world narratives.
- The evaluation framework is process-level.
- Experiments were conducted across six LLMs.
- Stronger models match or surpass human observers on some tasks.
- Models fall short on appraisal reasoning and positive emotion recognition.
- Performance dissociations exist across chain steps and sensitivity to appraisal interventions.
Entities
Institutions
- arXiv