First-Token Confidence Matches Semantic Self-Consistency for Hallucination Detection

other · 2026-05-07

The phi_first technique identifies hallucinations in large language models by assessing the normalized entropy of the top-K logits at the initial content-bearing token during a single greedy decode. This method either matches or surpasses the performance of semantic self-consistency, which necessitates multiple sampling and external inference. In tests involving three instruction-tuned models of 7-8B and two benchmarks, phi_first recorded an average AUROC of 0.820, while semantic agreement and standard surface-form self-consistency scored 0.793 and 0.791, respectively. Additionally, this approach is computationally efficient and shows a strong correlation with semantic agreement.

Key facts

phi_first uses first-token confidence from a single greedy decode.
It matches or exceeds semantic self-consistency on closed-book factual QA.
Mean AUROC of 0.820 vs 0.793 for semantic agreement and 0.791 for surface-form.
Tested on three 7-8B instruction-tuned models and two benchmarks.
Method is computationally efficient, avoiding repeated decoding.
Correlation with semantic agreement is moderate to strong.
Published on arXiv with ID 2605.05166.
Method uses normalized entropy of top-K logits.

First-Token Confidence Matches Semantic Self-Consistency for Hallucination Detection

Key facts

Entities

Institutions

Sources