LLM Confidence Calibration Depends on Measurement Choices

ai-technology · 2026-05-28

A recent study published on arXiv indicates that the calibration of confidence in large language models (LLMs) is significantly influenced by the methods used to assess token-probability scores and expressed confidence. The research maintains a consistent verbalized-confidence elicitation by utilizing one prompt template, a specific probability scale, and a uniform output format, while altering measurement parameters: the answer string assigned the token-probability score, the reading method of that score from answer tokens, and the conditioning context for measurement. Tested across four QA benchmarks on three open 7–8B base/Instruct model families, with larger Qwen2.5 variants serving as robustness checks, the results reveal that the conditioning context can affect both the sign and magnitude of the Expected Calibration Error (ECE) gap. These results highlight the importance of addressing protocol sensitivity in LLM confidence assessments.

Key facts

Study examines LLM confidence calibration by comparing token-probability scores and verbalized confidence.
Verbalized-confidence elicitation is held fixed: one prompt template, probability scale, and output format.
Measurement axes varied: which answer string receives token-probability score, how score is read, and conditioning context.
Evaluated on four QA benchmarks across three open 7–8B base/Instruct model families.
Larger Qwen2.5 variants used as same-family robustness checks.
Conditioning context changes sign or magnitude of ECE gap.
Paper available on arXiv with ID 2605.27752.
Highlights need for explicit measurement choices in confidence calibration research.

LLM Confidence Calibration Depends on Measurement Choices

Key facts

Entities

Institutions

Sources