New Detector CPD Identifies Fluent Adversarial Prompts in LLMs via Entropy Changes

ai-technology · 2026-05-20

A team of researchers has rolled out CPD Online, a new tool designed to spot optimization-based adversarial suffixes in large language models (LLMs). Unlike older methods that rely on static or windowed perplexity, CPD treats detection as an online change-point problem by examining token-level next-token entropy streams. It uses the LLM system prompt to create a solid baseline, normalizes user-token entropies, and applies a one-sided CUSUM statistic. CPD is adaptable, doesn’t need prior training, and works in real-time to detect adversarial suffixes. In tests with 1,012 optimization-based suffix attacks and the same number of benign prompts, CPD surpassed the traditional windowed-perplexity method, achieving an AUROC of 0.88 on LLaMA-2-7B.

Key facts

CPD Online detects fluent optimization-based adversarial suffixes in LLMs.
Detection is cast as an online change-point problem over token-level next-token entropy.
The detector uses the LLM system prompt to estimate a robust baseline.
It applies a one-sided CUSUM statistic on standardized user-token entropies.
CPD is model-agnostic, training-free, and runs online.
Benchmark includes 1,012 attacks (GCG, AutoDAN, AdvPrompter, BEAST, AutoDAN-HGA) and 1,012 benign prompts.
CPD improves F1 over windowed-perplexity baselines on all six tested models.
On LLaMA-2-7B at k=0, CPD achieves AUROC 0.88.

Entities

—

Sources

arXiv cs.AI — 2026-05-20