Study Reveals Bias and Unreliability in LLM-as-a-Judge Systems for Software Engineering

ai-technology · 2026-04-22

A new research paper examines the use of Large Language Models as evaluators for code artifacts in software engineering workflows. The study, published on arXiv under identifier 2604.16790v1, investigates LLM-judge systems that rank candidate solutions and guide patch selection when human review or test coverage is insufficient. Researchers analyzed two pointwise judging regimes across multiple tasks including code generation, code repair, and test generation. The paper systematically probes prompt-induced biases, revealing that repeated evaluations of identical cases often produce conflicting results. Small modifications to prompts can dramatically alter outcomes, while seemingly equivalent semantic perturbations elicit divergent verdicts. This measurement-first approach highlights current practices' lack of principled reliability frameworks despite the attractive scalability of LLM-based evaluation. The study considers difficulty levels across repeated runs and employs controlled prompt interventions to isolate specific presentation cues. As LLMs become increasingly integrated into agentic software engineering workflows, understanding these biases becomes crucial for developing more reliable evaluation systems.

Key facts

Large Language Models are increasingly used as judges to evaluate code artifacts
LLM-judge helps rank candidate solutions and guide patch selection in software engineering workflows
Current practice lacks principled account of reliability and bias
Repeated evaluations of the same case can disagree
Small prompt edits can swing outcomes
Seemingly semantics-preserving perturbations may elicit divergent verdicts
Paper studies LLM-as-a-Judge for code through measurement-first lens
Analyzes two pointwise judging regimes across code generation, repair, and test generation

Study Reveals Bias and Unreliability in LLM-as-a-Judge Systems for Software Engineering

Key facts

Entities

Institutions

Sources