LLM Verbal Confidence Fails Psychometric Validity in 3-9B Models
So, there’s this research study that popped up on arXiv (2604.22215) where they looked into if seven different instruction-tuned open-weight language models, which have between 3 and 9 billion parameters from four different families, could produce verbal confidence scores that meet basic validity requirements for Type-2 discrimination at the item level. They tested this with 524 TriviaQA items, using both numeric (0-100) and categorical (10-class) methods while running 8,384 trials on standard hardware. Unfortunately, all seven models failed to generate valid numeric confidence scores, hitting a mean ceiling rate of 91.7%. Using categorical methods didn’t help either and actually worsened performance for six of the models.
Key facts
- Study pre-registered on OSF (osf.io/azbvx)
- Seven instruction-tuned open-weight models tested
- Models from four families, 3-9B parameters
- 524 TriviaQA items used
- Numeric (0-100) and categorical (10-class) elicitation
- Greedy decoding applied
- 8,384 deterministic trials conducted
- All seven models invalid on numeric confidence
- Mean ceiling rate of 91.7%
- Categorical elicitation disrupted accuracy in six models
Entities
Institutions
- arXiv
- OSF