New AI Safety Benchmark Targets Covert Decomposition Attacks on Language Models
Researchers have introduced Benchmarks for Stateful Defenses (BSD), a novel data generation pipeline designed to evaluate defenses against covert adversarial attacks on language models. The work addresses a critical gap in AI safety evaluations, which traditionally focus on overt attacks and low-stakes scenarios. In reality, adversaries can bypass existing safeguards by submitting numerous small, seemingly benign queries that individually appear harmless but collectively enable the completion of dangerous tasks. This strategy, known as decomposition attacks, proves highly effective at facilitating misuse. Using the BSD pipeline, the team curated two new datasets that consistently trigger refusals from frontier models while remaining too challenging for weaker open-weight models. The research, documented in arXiv preprint 2506.06414v2, highlights stateful defenses as a promising countermeasure. These defenses maintain context across multiple queries to detect coordinated misuse patterns that would otherwise go unnoticed. The findings underscore the vulnerability of current safety mechanisms to sophisticated, fragmented attack strategies that exploit the incremental nature of assistance requests.
Key facts
- Researchers developed Benchmarks for Stateful Defenses (BSD) to evaluate AI safety against covert attacks
- Current language model safety evaluations focus on overt attacks and low-stakes tasks
- Adversaries can subvert safeguards using many small, benign-seeming queries across independent interactions
- Individual queries appear harmless but combined fragments uplift misuse for hard and dangerous tasks
- BSD is a data generation pipeline automating evaluations of covert attacks and corresponding defenses
- Two new datasets were curated that are consistently refused by frontier models
- The datasets are too difficult for weaker open-weight models to handle
- Decomposition attacks were found to be effective misuse enablers, with stateful defenses identified as promising counters
Entities
—