Policy Invariance: A New Reliability Test for LLM Safety Judges

ai-technology · 2026-05-09

A recent preprint on arXiv (2605.06161) presents 'policy invariance' as an essential characteristic for reliable safety evaluators of LLM-as-a-Judge. The authors contend that current benchmarks assume judge decisions are absolute truths without confirming if these decisions are influenced by agent actions or simply by the phrasing of the evaluation policy. They define policy invariance using three principles: maintaining rubric-semantics consistency through certified-equivalent rewrites, ensuring rubric-threshold stability during deliberate strict-to-lenient transitions, and employing ambiguity-aware calibration to highlight verdict inconsistencies in genuinely ambiguous scenarios. By assessing four agent-class judges with data from ASSEBench and R-Judge, they identify an unrecognized failure mode: judges react to significant normative changes and trivial structural modifications. The study suggests a stress-test protocol to reveal such failures.

Key facts

arXiv preprint 2605.06161 proposes policy invariance for LLM safety judges
Policy invariance has three principles: rubric-semantics invariance, rubric-threshold invariance, ambiguity-aware calibration
Tested on four agent-class judges using ASSEBench and R-Judge trajectories
Found judges respond to both meaningful normative shifts and meaningless structural rewrites
Existing benchmarks treat LLM verdicts as ground-truth proxies without checking policy dependence

Policy Invariance: A New Reliability Test for LLM Safety Judges

Key facts

Entities

Institutions

Sources