New Framework Measures LLM Reliability for Social Science Annotation Tasks

ai-technology · 2026-04-22

A new methodological framework called Inter-Prompt Reliability (IPR) has been introduced to assess the stability of large language model outputs when used for annotation in computational social science. The framework evaluates how consistently LLMs perform across semantically equivalent but linguistically varied prompts, drawing inspiration from traditional Inter-Rater Reliability measures. Researchers measured IPR using Pairwise Agreement Rate and its distribution to capture both consistency and stochastic behavior in model responses. The evaluation tested this framework on two distinct annotation tasks: the interpretative TREC task and the knowledge-anchored Politifact task. Results revealed substantial stochastic variation in LLM performance on interpretative tasks, while models appeared more stable when handling knowledge-based annotation work. The study further demonstrated that employing majority voting across multiple prompts significantly enhances reproducibility and reduces variance in LLM outputs. These findings suggest methodological considerations for researchers using LLMs in social science labeling applications.

Key facts

Inter-Prompt Reliability (IPR) framework evaluates LLM stability across prompt variations
Framework draws on Inter-Rater Reliability concepts from traditional research methods
Measured using Pairwise Agreement Rate (PAR) and its distribution
Tested on TREC (interpretative) and Politifact (knowledge-anchored) annotation tasks
LLMs show substantial stochastic variation in interpretative tasks
LLMs appear more stable in knowledge-based annotation tasks
Majority voting across prompts improves reproducibility and reduces variance
Addresses methodological reliability concerns in computational social science

New Framework Measures LLM Reliability for Social Science Annotation Tasks

Key facts

Entities

Institutions

Sources