PiCSAR: Probabilistic Confidence Selection for LLM Reasoning

ai-technology · 2026-05-01

Researchers have unveiled a novel technique named PiCSAR, which stands for Probabilistic Confidence Selection And Ranking. This innovative, training-free method assesses outputs from large language models and large reasoning models by analyzing the combined log-likelihood of the reasoning process along with the final answer. PiCSAR consists of two main parts: reasoning confidence and answer confidence. It has shown remarkable improvements, achieving a +10.18 on MATH500 and a +9.81 on AIME2025, outperforming traditional models while needing at least twice as few samples in 16 out of 20 cases. Additionally, the analysis reveals that reasoning chains that are accurate exhibit a significantly higher joint log-likelihood.

Key facts

PiCSAR is a training-free method for scoring reasoning chains.
It uses joint log-likelihood of reasoning and answer.
Achieves +10.18 on MATH500 benchmark.
Achieves +9.81 on AIME2025 benchmark.
Outperforms baselines with at least 2x fewer samples in 16/20 comparisons.
Decomposes into reasoning confidence and answer confidence.
Improves best-of-n sampling for LLMs and LRMs.
No ground-truth answers required for scoring.

Entities

—

Sources

arXiv cs.AI — 2026-05-01