ARTFEED — Contemporary Art Intelligence

MCJudgeBench: Benchmark for Constraint-Level LLM Judge Evaluation

ai-technology · 2026-05-07

A new benchmark, MCJudgeBench, evaluates LLM judges at the constraint level in multi-constraint instruction following. Each instance includes an instruction, candidate response, explicit constraint list, per-constraint gold labels (yes/partial/no), and controlled perturbations. The protocol tests judge stability with evaluation prompt variants. Proprietary and open-source LLM judges are assessed using correctness and inconsistency metrics, distinguishing intrinsic from procedural inconsistency. Results show judge reliability has multiple dimensions; strong overall performance does not guarantee equally robust constraint-level evaluation.

Key facts

  • MCJudgeBench is a benchmark for constraint-level judge evaluation.
  • It focuses on multi-constraint instruction following.
  • Each instance includes an instruction, candidate response, constraint list, per-constraint gold labels, and perturbations.
  • Evaluation prompt variants test judge stability.
  • Both proprietary and open-source LLM judges are evaluated.
  • Metrics include correctness and inconsistency.
  • Intrinsic inconsistency under stochastic decoding is distinguished from procedural inconsistency under prompt/response perturbations.
  • Strong overall performance does not guarantee equally robust constraint-level evaluation.

Entities

Institutions

  • arXiv

Sources