LLM Variability in Evidence Screening for Software Engineering SLRs
A study published on arXiv investigates the performance and variability of Large Language Models (LLMs) in screening studies for systematic literature reviews (SLRs) in software engineering. The research compares 12 LLMs from OpenAI, Google Gemini, Anthropic, and Llama against 4 classical classifiers (Logistic Regression, Support Vector Classification, Random Forest, Naive Bayes) using 518 papers from 2 real SLRs. The study examines three dimensions: LLM performance variability, impact of input metadata (abstract, title, keywords), and comparison with classical models. False negatives are identified as a key risk that can compromise review validity. The findings aim to provide evidence on LLM behavior during study screening, an area with limited prior research despite rapid LLM adoption.
Key facts
- Study compares 12 LLMs from 4 providers: OpenAI, Google Gemini, Anthropic, Llama
- Classical classifiers include Logistic Regression, Support Vector Classification, Random Forest, Naive Bayes
- Dataset consists of 518 papers from 2 real Systematic Literature Reviews
- Focuses on study screening phase in software engineering SLRs
- Examines impact of input metadata: abstract, title, keywords
- False negatives are risk-asymmetric and can compromise validity
- Limited evidence exists on LLM behavior during screening
- Published on arXiv with ID 2604.27006
Entities
Institutions
- OpenAI
- Google Gemini
- Anthropic
- Llama
- arXiv