Auditing AI Models' Adherence to Behavioral Constitutions
A recent preprint on arXiv (2605.24229) introduces a comprehensive audit pipeline designed to assess the adherence of advanced AI models to their documented behavioral guidelines when faced with adversarial, multi-turn challenges. These guidelines encompass Anthropic's constitution (2025a) and OpenAI's Model Spec (2025a), which are combined through character training and deliberative alignment. The pipeline breaks down each guideline into individual testable components (205 for Anthropic and 197 for OpenAI), creates adversarial situations using the Petri auditing agent, and employs a revised SURF-style rubric to identify superficial failures. This research seeks to evaluate the effectiveness of governance provided by these documents in practical deployment scenarios.
Key facts
- arXiv:2605.24229
- Anthropic's constitution (2025a)
- OpenAI's Model Spec (2025a)
- character training (Anthropic, 2024)
- deliberative alignment (Guan et al., 2024)
- 205 tenets for Anthropic
- 197 tenets for OpenAI
- Petri auditing agent (Anthropic, 2025b)
- SURF-style rubric search (Murray et al., 2026)
Entities
Institutions
- Anthropic
- OpenAI
- arXiv