ARTFEED — Contemporary Art Intelligence

First-Token Confidence Matches Semantic Self-Consistency for Hallucination Detection

other · 2026-05-07

The phi_first technique identifies hallucinations in large language models by assessing the normalized entropy of the top-K logits at the initial content-bearing token during a single greedy decode. This method either matches or surpasses the performance of semantic self-consistency, which necessitates multiple sampling and external inference. In tests involving three instruction-tuned models of 7-8B and two benchmarks, phi_first recorded an average AUROC of 0.820, while semantic agreement and standard surface-form self-consistency scored 0.793 and 0.791, respectively. Additionally, this approach is computationally efficient and shows a strong correlation with semantic agreement.

Key facts

  • phi_first uses first-token confidence from a single greedy decode.
  • It matches or exceeds semantic self-consistency on closed-book factual QA.
  • Mean AUROC of 0.820 vs 0.793 for semantic agreement and 0.791 for surface-form.
  • Tested on three 7-8B instruction-tuned models and two benchmarks.
  • Method is computationally efficient, avoiding repeated decoding.
  • Correlation with semantic agreement is moderate to strong.
  • Published on arXiv with ID 2605.05166.
  • Method uses normalized entropy of top-K logits.

Entities

Institutions

  • arXiv

Sources