LLM Miscalibration in Social Science Measurement
A new paper on arXiv (2605.11954) investigates miscalibration in large language models used for social science measurement. The study examines how confidence scores from models like GPT-5-mini and DeepSeek-V3.2 fail to align with actual correctness across 14 constructs, using a case study on FOMC to show that confidence-based filtering can alter regression estimates. The authors propose soft label distillation as a mitigation strategy.
Key facts
- arXiv paper 2605.11954 studies miscalibration in LLM-based social science measurement.
- Case study on FOMC shows confidence filtering changes regression estimates.
- Audits calibration across 14 social science constructs.
- Models include GPT-5-mini and DeepSeek-V3.2.
- Reported confidence poorly aligned with tolerance-based correctness.
- Proposes soft label distillation pipeline as mitigation.
Entities
Institutions
- arXiv