Self-Play AI Safety Method Has Fundamental Flaw: Self-Consistency Collapse
A new paper on arXiv (2605.08427) reveals a critical flaw in self-play red teaming, a method used to improve AI safety. In self-play, the same model acts as both attacker and defender in a zero-sum game, aiming for a Nash equilibrium where the model responds safely. However, the authors show that parameter sharing between roles leads to a collapse into self-consistency, where attacks fail to exert adversarial pressure on the defender. This limits the set of achievable Nash equilibria to trivial strategies like always refusing or oracle-like defenders, undermining practical safety guarantees. The paper was announced on arXiv on May 8, 2025.
Key facts
- Self-play red teaming uses the same model as attacker and defender in a zero-sum game.
- The method aims to converge to a Nash equilibrium for guaranteed safe responses.
- Parameter sharing improves stability but introduces theoretical and architectural limitations.
- The set of reachable Nash equilibria includes trivial always-refuse and oracle-like defenders.
- When attacker and defender share and update the same base model, dynamics collapse to self-consistency.
- Attacks do not enforce adversarial pressure on the defender due to self-consistency collapse.
- The paper is titled 'The Attacker in the Mirror: Breaking Self-Consistency in Safety via Anchored Bipolicy Self-Play'.
- The paper was published on arXiv with ID 2605.08427.
Entities
Institutions
- arXiv