ARTFEED — Contemporary Art Intelligence

Study Reveals Bias and Unreliability in LLM-as-a-Judge Systems for Software Engineering

ai-technology · 2026-04-22

A new research paper examines the use of Large Language Models as evaluators for code artifacts in software engineering workflows. The study, published on arXiv under identifier 2604.16790v1, investigates LLM-judge systems that rank candidate solutions and guide patch selection when human review or test coverage is insufficient. Researchers analyzed two pointwise judging regimes across multiple tasks including code generation, code repair, and test generation. The paper systematically probes prompt-induced biases, revealing that repeated evaluations of identical cases often produce conflicting results. Small modifications to prompts can dramatically alter outcomes, while seemingly equivalent semantic perturbations elicit divergent verdicts. This measurement-first approach highlights current practices' lack of principled reliability frameworks despite the attractive scalability of LLM-based evaluation. The study considers difficulty levels across repeated runs and employs controlled prompt interventions to isolate specific presentation cues. As LLMs become increasingly integrated into agentic software engineering workflows, understanding these biases becomes crucial for developing more reliable evaluation systems.

Key facts

  • Large Language Models are increasingly used as judges to evaluate code artifacts
  • LLM-judge helps rank candidate solutions and guide patch selection in software engineering workflows
  • Current practice lacks principled account of reliability and bias
  • Repeated evaluations of the same case can disagree
  • Small prompt edits can swing outcomes
  • Seemingly semantics-preserving perturbations may elicit divergent verdicts
  • Paper studies LLM-as-a-Judge for code through measurement-first lens
  • Analyzes two pointwise judging regimes across code generation, repair, and test generation

Entities

Institutions

  • arXiv

Sources