Adversarial Prompt Disentanglement Framework for LLM Security
A new defense framework called Adversarial Prompt Disentanglement (APD) has been proposed to protect Large Language Models (LLMs) from adversarial prompts that exploit semantic ambiguities. These attacks, including jailbreaking and prompt injection, bypass safety mechanisms and produce harmful outputs. The APD framework proactively identifies and neutralizes malicious components before LLM processing. It integrates three innovations: mutual information-based semantic decomposition to isolate adversarial and benign components, graph-based intent classification using spectral analysis to detect malicious patterns, and a lightweight transformer-based classifier. The framework aims to enhance the integrity and availability of LLMs in security-critical applications.
Key facts
- APD framework proposed for LLM security
- Addresses adversarial prompts exploiting semantic ambiguities
- Attacks include jailbreaking and prompt injection
- Proactive identification and neutralization of malicious components
- Three innovations: semantic decomposition, graph-based classification, transformer classifier
- Mutual information-based method ensures statistical independence
- Spectral analysis used for intent classification
- Targets security-critical applications
Entities
—