Frontier LLMs Struggle in Cybersecurity Benchmarks
A recent study from arXiv examines the readiness of advanced large language models for cybersecurity applications. The researchers created a dual-mode benchmark that includes white-box function-level vulnerability detection in C, Java, and Python (VulnLLM-R) and black-box web application security testing on five production-like applications, identifying 118 actual vulnerabilities across more than 20 CWE families. They assessed six leading models—GPT-5.4, Codex~5.3, Claude Opus~4.6, Sonnet~4.6, Gemini~3.1~Pro, and Gemini~3~Flash—along with two specialized models across four testing methods. Findings reveal that all frontier models exhibit false positive rates between 10-50% in white-box detection and only achieve 4-8% ground-truth coverage in black-box testing, which slightly increases to 10-19% with external tools. The study suggests that employing structured penetration-testing methodologies in specialized agents may enhance results, but overall, current frontier LLMs are not adequately prepared for practical cybersecurity tasks.
Key facts
- Dual-mode benchmark: white-box function-level vulnerability detection (VulnLLM-R) and black-box web application security testing.
- White-box covers C, Java, Python; black-box uses five production-style apps with 118 vulnerabilities across 20+ CWE families.
- Six frontier models tested: GPT-5.4, Codex~5.3, Claude Opus~4.6, Sonnet~4.6, Gemini~3.1~Pro, Gemini~3~Flash.
- Two domain-specialized models also tested.
- False positive rates of 10-50% in white-box detection for all frontier models.
- Black-box ground-truth coverage: 4-8% for frontier models, improving to 10-19% with external tools (Playwright MCP, Burp Suite MCP).
- Structured penetration-testing methodology in domain-specialized agents may improve results.
- Study concludes frontier LLMs are not ready for cybersecurity tasks.
Entities
Institutions
- arXiv