MCJudgeBench: Benchmark for Constraint-Level LLM Judge Evaluation

ai-technology · 2026-05-07

A new benchmark, MCJudgeBench, evaluates LLM judges at the constraint level in multi-constraint instruction following. Each instance includes an instruction, candidate response, explicit constraint list, per-constraint gold labels (yes/partial/no), and controlled perturbations. The protocol tests judge stability with evaluation prompt variants. Proprietary and open-source LLM judges are assessed using correctness and inconsistency metrics, distinguishing intrinsic from procedural inconsistency. Results show judge reliability has multiple dimensions; strong overall performance does not guarantee equally robust constraint-level evaluation.

Key facts

MCJudgeBench is a benchmark for constraint-level judge evaluation.
It focuses on multi-constraint instruction following.
Each instance includes an instruction, candidate response, constraint list, per-constraint gold labels, and perturbations.
Evaluation prompt variants test judge stability.
Both proprietary and open-source LLM judges are evaluated.
Metrics include correctness and inconsistency.
Intrinsic inconsistency under stochastic decoding is distinguished from procedural inconsistency under prompt/response perturbations.
Strong overall performance does not guarantee equally robust constraint-level evaluation.

MCJudgeBench: Benchmark for Constraint-Level LLM Judge Evaluation

Key facts

Entities

Institutions

Sources