PiCSAR: Probabilistic Confidence Selection for LLM Reasoning
Researchers have unveiled a novel technique named PiCSAR, which stands for Probabilistic Confidence Selection And Ranking. This innovative, training-free method assesses outputs from large language models and large reasoning models by analyzing the combined log-likelihood of the reasoning process along with the final answer. PiCSAR consists of two main parts: reasoning confidence and answer confidence. It has shown remarkable improvements, achieving a +10.18 on MATH500 and a +9.81 on AIME2025, outperforming traditional models while needing at least twice as few samples in 16 out of 20 cases. Additionally, the analysis reveals that reasoning chains that are accurate exhibit a significantly higher joint log-likelihood.
Key facts
- PiCSAR is a training-free method for scoring reasoning chains.
- It uses joint log-likelihood of reasoning and answer.
- Achieves +10.18 on MATH500 benchmark.
- Achieves +9.81 on AIME2025 benchmark.
- Outperforms baselines with at least 2x fewer samples in 16/20 comparisons.
- Decomposes into reasoning confidence and answer confidence.
- Improves best-of-n sampling for LLMs and LRMs.
- No ground-truth answers required for scoring.
Entities
—