LLM Agents for CTFs: Performance Revisited
A new study revisits claims that LLM agents achieve near-human success in Capture-the-Flag (CTF) cybersecurity challenges. Researchers engineered agent architectures of increasing complexity on 30 web-based CTFs covering 14 vulnerability classes, testing multiple LLM backbones against claude-code, a general-purpose agent. Key findings: claude-code solved 19/30 tasks, matching specialized architectures, revealing general-purpose agents as strong baselines. Both struggled on the same challenge categories, indicating persistent barriers. The work provides a second look at prior optimistic results.
Key facts
- Study revisits claims of near-human LLM agent performance in CTFs.
- Tested on 30 web-based CTF challenges across 14 vulnerability classes.
- Compared engineered architectures with claude-code general-purpose agent.
- Claude-code solved 19/30 tasks, comparable to specialized designs.
- Both agent types struggled in the same challenge categories.
- General-purpose agents are strong baselines for offensive security.
- Persistent barriers remain for LLM agents in CTF tasks.
- Research provides a second look at prior reported success rates.
Entities
Institutions
- arXiv