ARTFEED — Contemporary Art Intelligence

PromptAudit Framework Reveals Prompt Sensitivity in LLM Vulnerability Detection

ai-technology · 2026-05-26

A new study introduces PromptAudit, a controlled evaluation framework designed to isolate the effects of prompting strategies on large language models (LLMs) used for vulnerability detection. By fixing dataset, decoding, and parsing, the framework varies only the prompting strategy. Researchers tested five prompting strategies across five open-weight models on 1,000 CVEs, encompassing 6,074 code samples in 16 programming languages. Metrics evaluated include accuracy, recall, abstention, coverage, and effective F1. Results show that standard chain-of-thought prompting achieves the strongest overall operational performance. Few-shot prompting offers model-dependent benefits, particularly for prompt-sensitive models. Adaptive chain-of-thought frequently suppresses recall, while self-consistency induces excessive abstention, sharply reducing effective performance. The study underscores that vulnerability detection reliability is highly sensitive to prompt formulation.

Key facts

  • PromptAudit is a controlled evaluation framework for LLM vulnerability detection
  • Five prompting strategies tested across five open-weight models
  • Dataset includes 1,000 CVEs and 6,074 code samples in 16 programming languages
  • Standard chain-of-thought prompting achieves strongest overall performance
  • Few-shot prompting benefits are model-dependent
  • Adaptive chain-of-thought suppresses recall
  • Self-consistency induces excessive abstention
  • Study highlights prompt sensitivity in LLM-based vulnerability detection

Entities

Institutions

  • arXiv

Sources