LLM Agents for CTFs: Performance Revisited

ai-technology · 2026-05-23

A new study revisits claims that LLM agents achieve near-human success in Capture-the-Flag (CTF) cybersecurity challenges. Researchers engineered agent architectures of increasing complexity on 30 web-based CTFs covering 14 vulnerability classes, testing multiple LLM backbones against claude-code, a general-purpose agent. Key findings: claude-code solved 19/30 tasks, matching specialized architectures, revealing general-purpose agents as strong baselines. Both struggled on the same challenge categories, indicating persistent barriers. The work provides a second look at prior optimistic results.

Key facts

Study revisits claims of near-human LLM agent performance in CTFs.
Tested on 30 web-based CTF challenges across 14 vulnerability classes.
Compared engineered architectures with claude-code general-purpose agent.
Claude-code solved 19/30 tasks, comparable to specialized designs.
Both agent types struggled in the same challenge categories.
General-purpose agents are strong baselines for offensive security.
Persistent barriers remain for LLM agents in CTF tasks.
Research provides a second look at prior reported success rates.

LLM Agents for CTFs: Performance Revisited

Key facts

Entities

Institutions

Sources