Study Reveals When AI Agent Skills Fail in Cybersecurity

other · 2026-05-20

A recent study published on arXiv (2605.20023) questions the belief that incorporating procedural knowledge, known as Skills, into LLM agents always enhances their performance. Although Skills improve task success rates by an average of 16.2 percentage points across various domains, 16 out of 84 tasks experienced declines. The researchers revisited a controlled experiment involving 180 runs of an MCP-grounded autonomous Capture-the-Flag agent, examining four documentation conditions (55, 1,478, 1,976, and 4,147 lines), which represented No-Skills, Experiential-Skills, Curated-Skills, and Comprehensive-Skills variations. In the field of offensive cybersecurity, where benchmarks are limited, the advantages of Skills diminish. The community has yet to define a clear framework for understanding when Skills are beneficial versus when they add unnecessary complexity.

Key facts

Skills improve task pass rates by 16.2 percentage points on average
16 of 84 tasks show negative deltas when Skills are introduced
Study re-analyzed a 180-run controlled experiment
Agent used MCP-grounded autonomous Capture-the-Flag setup
Four documentation conditions: 55, 1,478, 1,976, and 4,147 lines
Conditions correspond to No-Skills, Experiential-Skills, Curated-Skills, Comprehensive-Skills
Offensive cybersecurity is not deeply covered by existing Skills benchmarks
Marginal benefit of Skills collapses in offensive cybersecurity

Study Reveals When AI Agent Skills Fail in Cybersecurity

Key facts

Entities

Institutions

Sources