LLM Prompt Strategies for Qualitative Coding in Software Engineering

ai-technology · 2026-05-11

An empirical study has been conducted to assess three large language models—Claude Haiku, DeepSeek-Chat, and Gemini 2.5 Flash—focusing on their ability to qualitatively code psychological safety within software engineering communities. This research contrasts zero-shot and multi-shot prompt engineering methods, utilizing Cohen's kappa as the metric for agreement across ten independent configurations. Findings reveal that multi-shot prompting notably enhances agreement for Claude Haiku (Delta kappa = +0.034). The study underscores the capacity of LLMs to aid qualitative analysis, while also stressing the importance of prompt design sensitivity and the necessity for reproducibility in reasoning akin to human thought.

Key facts

Study evaluates three LLMs: Claude Haiku, DeepSeek-Chat, Gemini 2.5 Flash
Compares zero-shot and multi-shot prompt engineering strategies
Uses Cohen's kappa as primary agreement metric
Ten independent runs per configuration
Multi-shot prompting improves agreement for Claude Haiku (Delta kappa = +0.034)
Focuses on qualitative coding of psychological safety in software engineering communities
Published on arXiv with ID 2605.07422
Highlights sensitivity of LLMs to prompt design

LLM Prompt Strategies for Qualitative Coding in Software Engineering

Key facts

Entities

Institutions

Sources