New Framework Measures LLM Reliability for Social Science Annotation Tasks
A new methodological framework called Inter-Prompt Reliability (IPR) has been introduced to assess the stability of large language model outputs when used for annotation in computational social science. The framework evaluates how consistently LLMs perform across semantically equivalent but linguistically varied prompts, drawing inspiration from traditional Inter-Rater Reliability measures. Researchers measured IPR using Pairwise Agreement Rate and its distribution to capture both consistency and stochastic behavior in model responses. The evaluation tested this framework on two distinct annotation tasks: the interpretative TREC task and the knowledge-anchored Politifact task. Results revealed substantial stochastic variation in LLM performance on interpretative tasks, while models appeared more stable when handling knowledge-based annotation work. The study further demonstrated that employing majority voting across multiple prompts significantly enhances reproducibility and reduces variance in LLM outputs. These findings suggest methodological considerations for researchers using LLMs in social science labeling applications.
Key facts
- Inter-Prompt Reliability (IPR) framework evaluates LLM stability across prompt variations
- Framework draws on Inter-Rater Reliability concepts from traditional research methods
- Measured using Pairwise Agreement Rate (PAR) and its distribution
- Tested on TREC (interpretative) and Politifact (knowledge-anchored) annotation tasks
- LLMs show substantial stochastic variation in interpretative tasks
- LLMs appear more stable in knowledge-based annotation tasks
- Majority voting across prompts improves reproducibility and reduces variance
- Addresses methodological reliability concerns in computational social science
Entities
Institutions
- arXiv