ARTFEED — Contemporary Art Intelligence

LLM Agents for CTFs: Performance Revisited

ai-technology · 2026-05-23

A new study revisits claims that LLM agents achieve near-human success in Capture-the-Flag (CTF) cybersecurity challenges. Researchers engineered agent architectures of increasing complexity on 30 web-based CTFs covering 14 vulnerability classes, testing multiple LLM backbones against claude-code, a general-purpose agent. Key findings: claude-code solved 19/30 tasks, matching specialized architectures, revealing general-purpose agents as strong baselines. Both struggled on the same challenge categories, indicating persistent barriers. The work provides a second look at prior optimistic results.

Key facts

  • Study revisits claims of near-human LLM agent performance in CTFs.
  • Tested on 30 web-based CTF challenges across 14 vulnerability classes.
  • Compared engineered architectures with claude-code general-purpose agent.
  • Claude-code solved 19/30 tasks, comparable to specialized designs.
  • Both agent types struggled in the same challenge categories.
  • General-purpose agents are strong baselines for offensive security.
  • Persistent barriers remain for LLM agents in CTF tasks.
  • Research provides a second look at prior reported success rates.

Entities

Institutions

  • arXiv

Sources