Frontier AI Models Score Below 50% on ITBench-AA SRE Benchmark
A new benchmark for agentic enterprise IT tasks, ITBench-AA SRE, shows all frontier AI models scoring below 50%. Developed by Artificial Analysis and IBM, the benchmark evaluates models on Kubernetes incident diagnosis. Claude Opus 4.7 leads at 47%, followed by GPT-5.5 at 46% and Qwen3.7 Max at 42%. Open-weight models like GLM-5.1 (40%) and Gemma 4 31B (37%) perform competitively at lower cost. The benchmark includes 59 tasks requiring models to identify root-cause entities from incident snapshots. Longer trajectories do not correlate with higher accuracy; models that over-investigate tend to produce false positives. The harness (Stirrup) is held constant for fair comparison. Results highlight that even advanced models struggle with complex IT operations, and cost efficiency varies widely.
Key facts
- Claude Opus 4.7 leads at 47% accuracy.
- GPT-5.5 scores 46%, Qwen3.7 Max at 42%.
- All frontier models score below 50% on ITBench-AA SRE.
- GLM-5.1 leads open-weight models at 40%.
- Gemma 4 31B scores 37% at $0.14 per task.
- Gemini 3.1 Pro Preview averages 83 turns but scores 30%.
- ITBench-AA includes 59 SRE tasks (40 public, 19 held-out).
- Scoring uses average precision at full recall.
Entities
Institutions
- Artificial Analysis
- IBM
- HuggingFace