New Cyber Defense Benchmark Evaluates LLM Agents for Threat Hunting in Security Operations

ai-technology · 2026-04-22

The Cyber Defense Benchmark has been unveiled by researchers as a framework designed to assess the performance of large language model agents in threat hunting within security operations centers. This benchmark emphasizes the detection of malicious activities in unprocessed Windows event logs, utilizing 106 actual attack methodologies from the OTRF Security-Datasets collection, which spans 86 MITRE ATT&CK sub-techniques across 12 tactics. Agents leverage an SQLite database containing between 75,000 and 135,000 log entries from a campaign simulator, executing SQL queries to identify timestamps of malicious events, evaluated in a CTF-style against Sigma-rule-based ground truth. The study assesses five models, including Claude Opus 4.6, GPT-5, Gemini 3.1 Pro, and Kimi K2.5. The findings are available on arXiv (identifier 2604.19533v1).

Key facts

The Cyber Defense Benchmark measures LLM agent performance in threat hunting tasks
It uses 106 real attack procedures from OTRF Security-Datasets corpus
Covers 86 MITRE ATT&CK sub-techniques across 12 tactics
Each episode provides 75,000-135,000 log records in SQLite database
Agents must submit SQL queries to find malicious event timestamps
Scoring is CTF-style against Sigma-rule-derived ground truth
Implemented as Gymnasium reinforcement-learning environment
Five frontier models evaluated: Claude Opus 4.6, GPT-5, Gemini 3.1 Pro, Kimi K2.5

New Cyber Defense Benchmark Evaluates LLM Agents for Threat Hunting in Security Operations

Key facts

Entities

Institutions

Sources