First-Token Confidence Matches Semantic Self-Consistency for Hallucination Detection
The phi_first technique identifies hallucinations in large language models by assessing the normalized entropy of the top-K logits at the initial content-bearing token during a single greedy decode. This method either matches or surpasses the performance of semantic self-consistency, which necessitates multiple sampling and external inference. In tests involving three instruction-tuned models of 7-8B and two benchmarks, phi_first recorded an average AUROC of 0.820, while semantic agreement and standard surface-form self-consistency scored 0.793 and 0.791, respectively. Additionally, this approach is computationally efficient and shows a strong correlation with semantic agreement.
Key facts
- phi_first uses first-token confidence from a single greedy decode.
- It matches or exceeds semantic self-consistency on closed-book factual QA.
- Mean AUROC of 0.820 vs 0.793 for semantic agreement and 0.791 for surface-form.
- Tested on three 7-8B instruction-tuned models and two benchmarks.
- Method is computationally efficient, avoiding repeated decoding.
- Correlation with semantic agreement is moderate to strong.
- Published on arXiv with ID 2605.05166.
- Method uses normalized entropy of top-K logits.
Entities
Institutions
- arXiv