Measuring Safety Alignment in Autonomous Security Agents
A recent study published on arXiv introduces a novel benchmark for assessing the safety alignment of language models used in autonomous security roles. This framework includes 30 tasks focusing on local vulnerability checks, employing fixed criteria and tools for evaluation, alongside redaction and grounding assessments. The research involved testing four safety-aligned models: Gemma 4 with 31 billion parameters, Gemma 4 with 26 billion parameters, Qwen2.5-Coder with 7 billion parameters, and Llama with 8 billion parameters. Results indicated that the unrestricted Gemma models outperformed their safety-aligned counterparts, highlighting that typical metrics may not accurately reflect the actual performance of agents.
Key facts
- arXiv paper 2605.19722 introduces a trace-based benchmark for safety alignment in autonomous security agents.
- Benchmark includes 30 local vulnerability-analysis tasks with fixed tools and deterministic success predicates.
- Compared four stock models against uncensored/abliterated derivatives: Gemma 4 31B, Gemma 4 26B A4B, Qwen2.5-Coder 7B, Llama 3.1 8B.
- Artifact contains 1,500 security-agent traces and 800 non-security control traces.
- Gemma 4 31B uncensored achieved 14.0% success vs 0.7% for safety-aligned version.
- Gemma 4 26B uncensored achieved 10.7% success vs 0.0% for safety-aligned version.
- Uncensored models had higher mean grounding scores (3.91 vs 3.27 and 4.12 vs 1.64 out of 5).
- Single-turn refusal benchmarks are insufficient for evaluating autonomous security agents.
Entities
Institutions
- arXiv