LLM Verbal Confidence Fails Psychometric Validity in 3-9B Models

ai-technology · 2026-04-27

So, there’s this research study that popped up on arXiv (2604.22215) where they looked into if seven different instruction-tuned open-weight language models, which have between 3 and 9 billion parameters from four different families, could produce verbal confidence scores that meet basic validity requirements for Type-2 discrimination at the item level. They tested this with 524 TriviaQA items, using both numeric (0-100) and categorical (10-class) methods while running 8,384 trials on standard hardware. Unfortunately, all seven models failed to generate valid numeric confidence scores, hitting a mean ceiling rate of 91.7%. Using categorical methods didn’t help either and actually worsened performance for six of the models.

Key facts

Study pre-registered on OSF (osf.io/azbvx)
Seven instruction-tuned open-weight models tested
Models from four families, 3-9B parameters
524 TriviaQA items used
Numeric (0-100) and categorical (10-class) elicitation
Greedy decoding applied
8,384 deterministic trials conducted
All seven models invalid on numeric confidence
Mean ceiling rate of 91.7%
Categorical elicitation disrupted accuracy in six models

LLM Verbal Confidence Fails Psychometric Validity in 3-9B Models

Key facts

Entities

Institutions

Sources