New Cyber Defense Benchmark Evaluates LLM Agents for Threat Hunting in Security Operations
The Cyber Defense Benchmark has been unveiled by researchers as a framework designed to assess the performance of large language model agents in threat hunting within security operations centers. This benchmark emphasizes the detection of malicious activities in unprocessed Windows event logs, utilizing 106 actual attack methodologies from the OTRF Security-Datasets collection, which spans 86 MITRE ATT&CK sub-techniques across 12 tactics. Agents leverage an SQLite database containing between 75,000 and 135,000 log entries from a campaign simulator, executing SQL queries to identify timestamps of malicious events, evaluated in a CTF-style against Sigma-rule-based ground truth. The study assesses five models, including Claude Opus 4.6, GPT-5, Gemini 3.1 Pro, and Kimi K2.5. The findings are available on arXiv (identifier 2604.19533v1).
Key facts
- The Cyber Defense Benchmark measures LLM agent performance in threat hunting tasks
- It uses 106 real attack procedures from OTRF Security-Datasets corpus
- Covers 86 MITRE ATT&CK sub-techniques across 12 tactics
- Each episode provides 75,000-135,000 log records in SQLite database
- Agents must submit SQL queries to find malicious event timestamps
- Scoring is CTF-style against Sigma-rule-derived ground truth
- Implemented as Gymnasium reinforcement-learning environment
- Five frontier models evaluated: Claude Opus 4.6, GPT-5, Gemini 3.1 Pro, Kimi K2.5
Entities
Institutions
- arXiv
- OTRF Security-Datasets
- MITRE ATT&CK
- Gymnasium