Auditing AI Models' Adherence to Behavioral Constitutions

ai-technology · 2026-05-26

A recent preprint on arXiv (2605.24229) introduces a comprehensive audit pipeline designed to assess the adherence of advanced AI models to their documented behavioral guidelines when faced with adversarial, multi-turn challenges. These guidelines encompass Anthropic's constitution (2025a) and OpenAI's Model Spec (2025a), which are combined through character training and deliberative alignment. The pipeline breaks down each guideline into individual testable components (205 for Anthropic and 197 for OpenAI), creates adversarial situations using the Petri auditing agent, and employs a revised SURF-style rubric to identify superficial failures. This research seeks to evaluate the effectiveness of governance provided by these documents in practical deployment scenarios.

Key facts

arXiv:2605.24229
Anthropic's constitution (2025a)
OpenAI's Model Spec (2025a)
character training (Anthropic, 2024)
deliberative alignment (Guan et al., 2024)
205 tenets for Anthropic
197 tenets for OpenAI
Petri auditing agent (Anthropic, 2025b)
SURF-style rubric search (Murray et al., 2026)

Auditing AI Models' Adherence to Behavioral Constitutions

Key facts

Entities

Institutions

Sources