FragBench: New Benchmark Exposes Cross-Session LLM Attacks
Researchers have unveiled FragBench, a benchmark aimed at identifying malicious prompts distributed across various LLM sessions. In contrast to current safety benchmarks that assess individual prompts or turns within a single conversation, FragBench detects attack signals that are dispersed across different sessions without any shared context. This benchmark is derived from 24 actual cyber-incident campaigns and encompasses the entire attack sequence: multi-fragment kill chains, safety-judge verdicts for each fragment, sandboxed execution traces, and corresponding benign cover sessions. FragBench divides this sequence into two components: FragBench Attack, an adversarial rewriter that fortifies fragments against a single-turn safety judge, and FragBench Defense, a graph-based user-level detector trained on the resulting interactions. The single-turn judge performs close to random chance on the released corpus, while four GNN variants and three classifiers are tested for defense. This research underscores a significant gap in LLM safety assessment and offers a novel tool for tackling cross-session threats.
Key facts
- FragBench is a benchmark for cross-session LLM attacks.
- It uses 24 real-world cyber-incident campaigns.
- Attackers split malicious goals into sub-prompts across sessions.
- Existing benchmarks evaluate single prompts or turns within one chat.
- FragBench includes multi-fragment kill chains and safety-judge verdicts.
- It has two tasks: FragBench Attack and FragBench Defense.
- The single-turn judge is near chance on the corpus by construction.
- Four GNN variants and three classifiers are evaluated for defense.
Entities
Institutions
- arXiv