MCJudgeBench: Benchmark for Constraint-Level LLM Judge Evaluation
A new benchmark, MCJudgeBench, evaluates LLM judges at the constraint level in multi-constraint instruction following. Each instance includes an instruction, candidate response, explicit constraint list, per-constraint gold labels (yes/partial/no), and controlled perturbations. The protocol tests judge stability with evaluation prompt variants. Proprietary and open-source LLM judges are assessed using correctness and inconsistency metrics, distinguishing intrinsic from procedural inconsistency. Results show judge reliability has multiple dimensions; strong overall performance does not guarantee equally robust constraint-level evaluation.
Key facts
- MCJudgeBench is a benchmark for constraint-level judge evaluation.
- It focuses on multi-constraint instruction following.
- Each instance includes an instruction, candidate response, constraint list, per-constraint gold labels, and perturbations.
- Evaluation prompt variants test judge stability.
- Both proprietary and open-source LLM judges are evaluated.
- Metrics include correctness and inconsistency.
- Intrinsic inconsistency under stochastic decoding is distinguished from procedural inconsistency under prompt/response perturbations.
- Strong overall performance does not guarantee equally robust constraint-level evaluation.
Entities
Institutions
- arXiv