PCFJudge Reduces Order Sensitivity in LLM Factuality Evaluation

ai-technology · 2026-05-07

A new method called PCFJudge addresses candidate-order sensitivity in listwise factuality evaluation by large language models (LLMs). The technique reruns the same factuality-first prompt over multiple orderings of candidate sets and aggregates scores, ranks, and uncertainty signals into a consensus decision. On RewardBench 2 Factuality, PCFJudge improves accuracy by up to 7 absolute points over direct judging. Development ablations reveal that permutation consensus itself, rather than heavier arbitration layers, drives the improvement. The research suggests that a significant portion of factuality-judging error stems from order instability.

Key facts

PCFJudge is an inference-time method for LLM factuality evaluation.
It reruns the same prompt over multiple orderings of candidate sets.
Aggregates scores, ranks, and uncertainty signals into a consensus.
Improves over direct judging by up to 7 absolute points on RewardBench 2 Factuality.
Permutation consensus is the dominant source of gain.
Order instability contributes meaningfully to factuality-judging error.
The method addresses candidate-order sensitivity in listwise evaluation.
The research is published on arXiv with ID 2603.20562.

PCFJudge Reduces Order Sensitivity in LLM Factuality Evaluation

Key facts

Entities

Institutions

Sources