Adversarial Empathy Benchmark Tests RL-Trained AI Robustness
A recent study published on arXiv presents the Adversarial Empathy Benchmark (AEB) and the Emotional Consistency Score (ECS) to assess the resilience of reinforcement learning from verifiable emotion rewards (RLVER) trained language models. While RLVER models aim for empathetic interaction, they often operate under the assumption of cooperative users, overlooking real-world dynamics such as gaslighting and pressure for unconditional validation. The AEB features six adversarial trajectory types grounded in psychology, incorporating reward structures that discourage formulaic replies. The ECS distinctly evaluates a model's capability to monitor emotional states versus enhancing them. The study examined eight scenario-aligned conditions involving two RLVER models and two base models, with variations of think and no-think conditions, revealing weaknesses in existing empathetic AI systems and offering a method for testing emotional consistency in challenging situations.
Key facts
- arXiv paper 2605.07138 introduces AEB and ECS
- RLVER models show strong empathy on cooperative benchmarks
- Real emotional interactions include gaslighting and escalation
- AEB comprises six adversarial trajectory types
- ECS disentangles tracking emotional states from improving them
- Experiment tested eight scenario-matched conditions
- Two RLVER models and two base models were used
- Think and no-think conditions were applied
Entities
Institutions
- arXiv