Defensibility Index and Probabilistic Signal for AI Content Moderation
In an April 2025 paper (2604.20972) published on arXiv, researchers unveil the Defensibility Index (DI) and Ambiguity Index (AI) aimed at assessing AI content moderation systems that adhere to rules. They highlight the "Agreement Trap," a phenomenon where conventional agreement metrics unjustly penalize sound decisions and misinterpret ambiguity as a mistake. As a solution, they suggest a policy-grounded correctness framework for evaluation. The Probabilistic Defensibility Signal (PDS) leverages token logprobs from audit models to gauge reasoning stability without needing extra audits. Additionally, LLM reasoning traces act as a governance signal, confirming if a decision logically follows from the established rule hierarchy, rather than merely classifying content. This framework was tested on more than 193,000 reasoning traces.
Key facts
- Paper arXiv:2604.20972 released in April 2025
- Introduces Defensibility Index (DI) and Ambiguity Index (AI)
- Identifies the Agreement Trap in content moderation evaluation
- Proposes policy-grounded correctness as evaluation framework
- Probabilistic Defensibility Signal (PDS) derived from audit-model token logprobs
- LLM reasoning traces used as governance signal
- Audit model verifies logical derivability from rule hierarchy
- Validated on 193,000+ reasoning traces
Entities
Institutions
- arXiv