LLM Variability in Evidence Screening for Software Engineering SLRs

publication · 2026-05-01

A study published on arXiv investigates the performance and variability of Large Language Models (LLMs) in screening studies for systematic literature reviews (SLRs) in software engineering. The research compares 12 LLMs from OpenAI, Google Gemini, Anthropic, and Llama against 4 classical classifiers (Logistic Regression, Support Vector Classification, Random Forest, Naive Bayes) using 518 papers from 2 real SLRs. The study examines three dimensions: LLM performance variability, impact of input metadata (abstract, title, keywords), and comparison with classical models. False negatives are identified as a key risk that can compromise review validity. The findings aim to provide evidence on LLM behavior during study screening, an area with limited prior research despite rapid LLM adoption.

Key facts

Study compares 12 LLMs from 4 providers: OpenAI, Google Gemini, Anthropic, Llama
Classical classifiers include Logistic Regression, Support Vector Classification, Random Forest, Naive Bayes
Dataset consists of 518 papers from 2 real Systematic Literature Reviews
Focuses on study screening phase in software engineering SLRs
Examines impact of input metadata: abstract, title, keywords
False negatives are risk-asymmetric and can compromise validity
Limited evidence exists on LLM behavior during screening
Published on arXiv with ID 2604.27006

Entities

Institutions

OpenAI
Google Gemini
Anthropic
Llama
arXiv

Sources

arXiv cs.AI — 2026-05-01