Local LLMs Achieve 43-45% Accuracy in Python Bug Detection

ai-technology · 2026-04-29

An empirical investigation was carried out to evaluate the effectiveness of locally implemented large language models (LLMs) for detecting Python bugs in real-world applications, utilizing the BugsInPy benchmark. The research focused on LLaMA 3.2 and Mistral, analyzing 349 bugs from 17 different projects through a zero-shot prompting method at the function level, alongside an automated keyword-based evaluation framework. Findings indicate that models run locally achieve an accuracy ranging from 43% to 45%, often providing partially correct answers that highlight problematic code areas without offering specific solutions. The study underscores significant performance variations among projects and addresses limitations in previous research that depended on cloud-based models or specialized hardware, which restricts use in privacy-sensitive or resource-limited settings.

Key facts

LLaMA 3.2 and Mistral were evaluated for Python bug detection.
349 bugs across 17 projects from the BugsInPy benchmark were used.
Zero-shot prompting at the function level was employed.
Accuracy ranged from 43% to 45%.
Many responses were partially correct, identifying problematic regions but not exact fixes.
Performance varied significantly across projects.
The study addresses limitations of cloud-based models in privacy-sensitive contexts.
The evaluation framework was automated and keyword-based.

Local LLMs Achieve 43-45% Accuracy in Python Bug Detection

Key facts

Entities

Institutions

Sources