Covert Control Attacks on LLMs via Data Poisoning

ai-technology · 2026-05-27

A new data poisoning method, covert control attacks, teaches large language models an information hiding scheme through semantic associations, enabling stealthy encoding and decoding of arbitrary malicious instructions. This approach outperforms heuristic-based prompt injection attacks across 5 LLMs, 3 backdoor defenses, and 4 prompt injection defenses with a small poisoned fraction.

Key facts

Proposed method teaches LLMs an information hiding scheme via semantic associations.
Attack encodes and decodes arbitrary malicious instructions.
Evaluated across 5 LLMs, 3 backdoor defenses, and 4 prompt injection defenses.
Outperforms heuristic-based prompt injection attacks with small poisoned fraction.
Existing defenses like outlier detection, clean-data regularization, or online monitoring can neutralize fixed trigger phrases.
Covert control attacks reveal a new subtle poisoning-induced vulnerability.
LLMs are often fine-tuned on uncurated text datasets that adversaries can poison.
The hiding scheme relies on shared knowledge such as facts or concepts.

Covert Control Attacks on LLMs via Data Poisoning

Key facts

Entities

Institutions

Sources