CASPO Framework Boosts Reasoning Reliability in LLMs

ai-technology · 2026-05-11

Researchers have introduced a novel framework named CASPO (Confidence-Aware Step-wise Preference Optimization) that enhances the dependability of large reasoning models by synchronizing token-level confidence with logical accuracy at each step. This framework, outlined in arXiv:2605.07353, employs iterative Direct Preference Optimization and does not necessitate an external reward model or verifiers. A supplementary technique, Confidence-aware Thought (CaT), streamlines uncertain reasoning paths during inference with minimal delay. Testing across ten benchmarks and various model families demonstrates significant improvements in both reasoning reliability and inference speed. CASPO is compatible with Qwen3-8B-Base and outperforms tree-search baselines in AIME'24 and AIME'25 without the need for extra sampling.

Key facts

CASPO aligns token-level confidence with step-wise logical correctness.
Uses iterative Direct Preference Optimization without a separate reward model.
CaT prunes uncertain reasoning branches during inference with O(V) latency.
Tested across ten benchmarks and multiple model families.
Scales to Qwen3-8B-Base.
Surpasses tree-search baselines on AIME'24 and AIME'25.
No external verifiers or massive sampling required.
Published on arXiv with ID 2605.07353.

CASPO Framework Boosts Reasoning Reliability in LLMs

Key facts

Entities

Institutions

Sources