LLM Miscalibration in Social Science Measurement

ai-technology · 2026-05-13

A new paper on arXiv (2605.11954) investigates miscalibration in large language models used for social science measurement. The study examines how confidence scores from models like GPT-5-mini and DeepSeek-V3.2 fail to align with actual correctness across 14 constructs, using a case study on FOMC to show that confidence-based filtering can alter regression estimates. The authors propose soft label distillation as a mitigation strategy.

Key facts

arXiv paper 2605.11954 studies miscalibration in LLM-based social science measurement.
Case study on FOMC shows confidence filtering changes regression estimates.
Audits calibration across 14 social science constructs.
Models include GPT-5-mini and DeepSeek-V3.2.
Reported confidence poorly aligned with tolerance-based correctness.
Proposes soft label distillation pipeline as mitigation.

LLM Miscalibration in Social Science Measurement

Key facts

Entities

Institutions

Sources