ARTFEED — Contemporary Art Intelligence

LLM Verbal Confidence Fails Psychometric Validity in 3-9B Models

ai-technology · 2026-04-27

So, there’s this research study that popped up on arXiv (2604.22215) where they looked into if seven different instruction-tuned open-weight language models, which have between 3 and 9 billion parameters from four different families, could produce verbal confidence scores that meet basic validity requirements for Type-2 discrimination at the item level. They tested this with 524 TriviaQA items, using both numeric (0-100) and categorical (10-class) methods while running 8,384 trials on standard hardware. Unfortunately, all seven models failed to generate valid numeric confidence scores, hitting a mean ceiling rate of 91.7%. Using categorical methods didn’t help either and actually worsened performance for six of the models.

Key facts

  • Study pre-registered on OSF (osf.io/azbvx)
  • Seven instruction-tuned open-weight models tested
  • Models from four families, 3-9B parameters
  • 524 TriviaQA items used
  • Numeric (0-100) and categorical (10-class) elicitation
  • Greedy decoding applied
  • 8,384 deterministic trials conducted
  • All seven models invalid on numeric confidence
  • Mean ceiling rate of 91.7%
  • Categorical elicitation disrupted accuracy in six models

Entities

Institutions

  • arXiv
  • OSF

Sources