ARTFEED — Contemporary Art Intelligence

GT-HarmBench: Game-Theoretic AI Safety Benchmark Reveals 38% Failure Rate in High-Stakes Scenarios

ai-technology · 2026-05-25

Researchers introduced GT-HarmBench, a benchmark of 1,535 high-stakes scenarios based on game theory structures like Prisoner's Dilemma, Stag Hunt, and Chicken. Drawing from the MIT AI Risk Repository, the benchmark tests frontier AI models in multi-agent environments. Across 15 models, agents failed to choose socially beneficial actions in 38% of cases involving military escalation, election manipulation, and medical malpractice. The study measured sensitivity to prompt framing and ordering, and found that game-theoretic interventions improved outcomes by up to 18%. The results highlight reliability issues in multi-agent AI safety.

Key facts

  • GT-HarmBench includes 1,535 scenarios
  • Based on game theory structures: Prisoner's Dilemma, Stag Hunt, Chicken
  • Scenarios from MIT AI Risk Repository
  • Tested 15 frontier AI models
  • 38% failure rate in high-stakes cases
  • Failures include military escalation, election manipulation, medical malpractice
  • Game-theoretic interventions improved outcomes by up to 18%
  • Published on arXiv (2602.12316)

Entities

Institutions

  • MIT AI Risk Repository
  • arXiv

Sources