WARDEN: Distributionally Robust Adversarial Training for LLMs
A novel adversarial training framework named WARDEN has been introduced to enhance the resilience of large language models (LLMs) against adversarial prompts. Although there have been improvements in alignment and safety, LLMs still face threats from new attack methods. Current adversarial training techniques tend to be costly in terms of computation and challenging to scale. Recent methods like CAT and CAPO have tackled this issue by applying gradient-based perturbations within the embedding space. WARDEN advances this approach by dynamically adjusting the weight of adversarial examples using an f-divergence ambiguity set around the empirical training distribution, focusing on optimizing the worst-case adversarial loss within a divergence ball. This framework seeks to deliver greater efficiency and scalability in robustness.
Key facts
- WARDEN is a distributionally robust adversarial training framework for LLMs.
- It dynamically reweights adversarial examples using an f-divergence ambiguity set.
- The method optimizes worst-case adversarial loss within a divergence ball.
- Existing approaches like CAT and CAPO use gradient-based perturbations in embedding space.
- LLMs remain vulnerable to adversarial prompting despite alignment and safety advances.
- Adversarial training can improve robustness but is computationally expensive.
- The framework is designed to be more efficient and scalable.
- The paper is available on arXiv with ID 2605.05415.
Entities
Institutions
- arXiv