Self-Consistency Distillation Fails on Gemma 3 4B Verbal Confidence
A pre-registered study investigated whether using confidence-conditioned supervised fine-tuning (CSFT) with self-consistency targets could improve verbal confidence calibration in smaller instruct-tuned language models. They tested this on Gemma 3 4B-it, applying a modal filter to focus only on training items that had correct modal responses. Unfortunately, this led to a decline in AUROC2 from 0.554 to 0.509, attributed to label-entropy collapse. In a follow-up, they removed the filter and trained on all 2,000 calibration items, creating a binary verbal correctness discriminator that achieved AUROC2 = 0.774 on the held-out TriviaQA. This method effectively condensed a 10-sample self-consistency signal (AUROC2 = 0.999) into a more efficient single-pass readout, while a shuffled-target control showed no improvements.
Key facts
- Pre-registered Phase 0 protocol on Gemma 3 4B-it
- Modal filter restricted training to items with correct modal answers
- AUROC2 dropped from 0.554 to 0.509
- Exploratory rescue removed the filter and trained on all 2,000 calibration items
- Binary verbal correctness discriminator achieved AUROC2 = 0.774 on held-out TriviaQA
- 10-sample self-consistency signal had AUROC2 = 0.999
- Single-pass readout exceeded logit entropy (0.701)
- Shuffled-target control showed no improvement
Entities
Institutions
- arXiv