Adversarial Empathy Benchmark Tests RL-Trained AI Robustness

ai-technology · 2026-05-11

A recent study published on arXiv presents the Adversarial Empathy Benchmark (AEB) and the Emotional Consistency Score (ECS) to assess the resilience of reinforcement learning from verifiable emotion rewards (RLVER) trained language models. While RLVER models aim for empathetic interaction, they often operate under the assumption of cooperative users, overlooking real-world dynamics such as gaslighting and pressure for unconditional validation. The AEB features six adversarial trajectory types grounded in psychology, incorporating reward structures that discourage formulaic replies. The ECS distinctly evaluates a model's capability to monitor emotional states versus enhancing them. The study examined eight scenario-aligned conditions involving two RLVER models and two base models, with variations of think and no-think conditions, revealing weaknesses in existing empathetic AI systems and offering a method for testing emotional consistency in challenging situations.

Key facts

arXiv paper 2605.07138 introduces AEB and ECS
RLVER models show strong empathy on cooperative benchmarks
Real emotional interactions include gaslighting and escalation
AEB comprises six adversarial trajectory types
ECS disentangles tracking emotional states from improving them
Experiment tested eight scenario-matched conditions
Two RLVER models and two base models were used
Think and no-think conditions were applied

Adversarial Empathy Benchmark Tests RL-Trained AI Robustness

Key facts

Entities

Institutions

Sources