New Research Tool Evolve-CTF Evaluates AI Agent Robustness in Cybersecurity Tasks

ai-technology · 2026-04-20

A new research paper introduces CTF challenge families as a method for evaluating agentic large language models in cybersecurity tasks. The approach uses semantics-preserving program transformations to create multiple versions of capture-the-flag challenges while keeping the underlying exploit strategy identical. Researchers developed Evolve-CTF, a tool that generates these challenge families from Python-based CTF problems. The study evaluated 13 different agentic LLM configurations with tool access using families derived from Cybench and Intercode challenges. Findings reveal that models demonstrate significant robustness against simple transformations like renaming and code insertion. However, performance degrades substantially when faced with composed transformations and deeper obfuscation techniques. The research addresses limitations in existing pointwise benchmarks that provide insufficient insight into agent robustness and generalization capabilities. This work enables controlled evaluation of how AI agents handle alternative versions of source code while maintaining semantic equivalence. The paper is available on arXiv under identifier 2602.05523v2 with announcement type replace-cross.

Key facts

Researchers introduced CTF challenge families for evaluating agentic LLMs in cybersecurity
Semantics-preserving program transformations create multiple challenge versions with identical exploit strategies
Evolve-CTF tool generates CTF families from Python challenges using various transformations
Study evaluated 13 agentic LLM configurations with tool access
Models showed robustness to renaming and code insertion transformations
Performance degraded with composed transformations and deeper obfuscation
Research used families derived from Cybench and Intercode challenges
Paper available on arXiv as 2602.05523v2 with announcement type replace-cross

New Research Tool Evolve-CTF Evaluates AI Agent Robustness in Cybersecurity Tasks

Key facts

Entities

Institutions

Sources