ARTFEED — Contemporary Art Intelligence

Self-Consistency Distillation Fails on Gemma 3 4B Verbal Confidence

ai-technology · 2026-04-29

A pre-registered study investigated whether using confidence-conditioned supervised fine-tuning (CSFT) with self-consistency targets could improve verbal confidence calibration in smaller instruct-tuned language models. They tested this on Gemma 3 4B-it, applying a modal filter to focus only on training items that had correct modal responses. Unfortunately, this led to a decline in AUROC2 from 0.554 to 0.509, attributed to label-entropy collapse. In a follow-up, they removed the filter and trained on all 2,000 calibration items, creating a binary verbal correctness discriminator that achieved AUROC2 = 0.774 on the held-out TriviaQA. This method effectively condensed a 10-sample self-consistency signal (AUROC2 = 0.999) into a more efficient single-pass readout, while a shuffled-target control showed no improvements.

Key facts

  • Pre-registered Phase 0 protocol on Gemma 3 4B-it
  • Modal filter restricted training to items with correct modal answers
  • AUROC2 dropped from 0.554 to 0.509
  • Exploratory rescue removed the filter and trained on all 2,000 calibration items
  • Binary verbal correctness discriminator achieved AUROC2 = 0.774 on held-out TriviaQA
  • 10-sample self-consistency signal had AUROC2 = 0.999
  • Single-pass readout exceeded logit entropy (0.701)
  • Shuffled-target control showed no improvement

Entities

Institutions

  • arXiv

Sources