New Detector CPD Identifies Fluent Adversarial Prompts in LLMs via Entropy Changes
A team of researchers has rolled out CPD Online, a new tool designed to spot optimization-based adversarial suffixes in large language models (LLMs). Unlike older methods that rely on static or windowed perplexity, CPD treats detection as an online change-point problem by examining token-level next-token entropy streams. It uses the LLM system prompt to create a solid baseline, normalizes user-token entropies, and applies a one-sided CUSUM statistic. CPD is adaptable, doesn’t need prior training, and works in real-time to detect adversarial suffixes. In tests with 1,012 optimization-based suffix attacks and the same number of benign prompts, CPD surpassed the traditional windowed-perplexity method, achieving an AUROC of 0.88 on LLaMA-2-7B.
Key facts
- CPD Online detects fluent optimization-based adversarial suffixes in LLMs.
- Detection is cast as an online change-point problem over token-level next-token entropy.
- The detector uses the LLM system prompt to estimate a robust baseline.
- It applies a one-sided CUSUM statistic on standardized user-token entropies.
- CPD is model-agnostic, training-free, and runs online.
- Benchmark includes 1,012 attacks (GCG, AutoDAN, AdvPrompter, BEAST, AutoDAN-HGA) and 1,012 benign prompts.
- CPD improves F1 over windowed-perplexity baselines on all six tested models.
- On LLaMA-2-7B at k=0, CPD achieves AUROC 0.88.
Entities
—