ARTFEED — Contemporary Art Intelligence

PCFJudge Reduces Order Sensitivity in LLM Factuality Evaluation

ai-technology · 2026-05-07

A new method called PCFJudge addresses candidate-order sensitivity in listwise factuality evaluation by large language models (LLMs). The technique reruns the same factuality-first prompt over multiple orderings of candidate sets and aggregates scores, ranks, and uncertainty signals into a consensus decision. On RewardBench 2 Factuality, PCFJudge improves accuracy by up to 7 absolute points over direct judging. Development ablations reveal that permutation consensus itself, rather than heavier arbitration layers, drives the improvement. The research suggests that a significant portion of factuality-judging error stems from order instability.

Key facts

  • PCFJudge is an inference-time method for LLM factuality evaluation.
  • It reruns the same prompt over multiple orderings of candidate sets.
  • Aggregates scores, ranks, and uncertainty signals into a consensus.
  • Improves over direct judging by up to 7 absolute points on RewardBench 2 Factuality.
  • Permutation consensus is the dominant source of gain.
  • Order instability contributes meaningfully to factuality-judging error.
  • The method addresses candidate-order sensitivity in listwise evaluation.
  • The research is published on arXiv with ID 2603.20562.

Entities

Institutions

  • arXiv

Sources