PCFJudge Reduces Order Sensitivity in LLM Factuality Evaluation
A new method called PCFJudge addresses candidate-order sensitivity in listwise factuality evaluation by large language models (LLMs). The technique reruns the same factuality-first prompt over multiple orderings of candidate sets and aggregates scores, ranks, and uncertainty signals into a consensus decision. On RewardBench 2 Factuality, PCFJudge improves accuracy by up to 7 absolute points over direct judging. Development ablations reveal that permutation consensus itself, rather than heavier arbitration layers, drives the improvement. The research suggests that a significant portion of factuality-judging error stems from order instability.
Key facts
- PCFJudge is an inference-time method for LLM factuality evaluation.
- It reruns the same prompt over multiple orderings of candidate sets.
- Aggregates scores, ranks, and uncertainty signals into a consensus.
- Improves over direct judging by up to 7 absolute points on RewardBench 2 Factuality.
- Permutation consensus is the dominant source of gain.
- Order instability contributes meaningfully to factuality-judging error.
- The method addresses candidate-order sensitivity in listwise evaluation.
- The research is published on arXiv with ID 2603.20562.
Entities
Institutions
- arXiv