Self-Consistency Distillation Fails on Gemma 3 4B Verbal Confidence

ai-technology · 2026-04-29

A pre-registered study investigated whether using confidence-conditioned supervised fine-tuning (CSFT) with self-consistency targets could improve verbal confidence calibration in smaller instruct-tuned language models. They tested this on Gemma 3 4B-it, applying a modal filter to focus only on training items that had correct modal responses. Unfortunately, this led to a decline in AUROC2 from 0.554 to 0.509, attributed to label-entropy collapse. In a follow-up, they removed the filter and trained on all 2,000 calibration items, creating a binary verbal correctness discriminator that achieved AUROC2 = 0.774 on the held-out TriviaQA. This method effectively condensed a 10-sample self-consistency signal (AUROC2 = 0.999) into a more efficient single-pass readout, while a shuffled-target control showed no improvements.

Key facts

Pre-registered Phase 0 protocol on Gemma 3 4B-it
Modal filter restricted training to items with correct modal answers
AUROC2 dropped from 0.554 to 0.509
Exploratory rescue removed the filter and trained on all 2,000 calibration items
Binary verbal correctness discriminator achieved AUROC2 = 0.774 on held-out TriviaQA
10-sample self-consistency signal had AUROC2 = 0.999
Single-pass readout exceeded logit entropy (0.701)
Shuffled-target control showed no improvement

Self-Consistency Distillation Fails on Gemma 3 4B Verbal Confidence

Key facts

Entities

Institutions

Sources