LLM Prompt Strategies for Qualitative Coding in Software Engineering
An empirical study has been conducted to assess three large language models—Claude Haiku, DeepSeek-Chat, and Gemini 2.5 Flash—focusing on their ability to qualitatively code psychological safety within software engineering communities. This research contrasts zero-shot and multi-shot prompt engineering methods, utilizing Cohen's kappa as the metric for agreement across ten independent configurations. Findings reveal that multi-shot prompting notably enhances agreement for Claude Haiku (Delta kappa = +0.034). The study underscores the capacity of LLMs to aid qualitative analysis, while also stressing the importance of prompt design sensitivity and the necessity for reproducibility in reasoning akin to human thought.
Key facts
- Study evaluates three LLMs: Claude Haiku, DeepSeek-Chat, Gemini 2.5 Flash
- Compares zero-shot and multi-shot prompt engineering strategies
- Uses Cohen's kappa as primary agreement metric
- Ten independent runs per configuration
- Multi-shot prompting improves agreement for Claude Haiku (Delta kappa = +0.034)
- Focuses on qualitative coding of psychological safety in software engineering communities
- Published on arXiv with ID 2605.07422
- Highlights sensitivity of LLMs to prompt design
Entities
Institutions
- arXiv