WARD: A New Defense Against Prompt Injection Attacks on Web Agents
Researchers have introduced WARD (Web Agent Robust Defense against Prompt Injection), a protective model aimed at safeguarding web agents from prompt injection threats. These threats take advantage of weaknesses in open web settings by inserting harmful commands into HTML or visual interfaces. Current guard models face challenges such as inadequate generalization to new domains, elevated false positive rates, latency problems, and vulnerability to changing adversarial tactics. WARD utilizes two datasets: WARD-Base, which comprises around 177,000 samples from 719 popular URLs and platforms, and WARD-PIG, tailored for prompt injection attacks aimed at the guard model. Additionally, it features A3T, an adversarial training method to bolster resilience. This research is available on arXiv under ID 2605.15030.
Key facts
- WARD stands for Web Agent Robust Defense against Prompt Injection.
- The model addresses prompt injection attacks on web agents.
- WARD-Base dataset includes around 177,000 samples from 719 URLs.
- WARD-PIG dataset targets guard model-specific attacks.
- A3T is an adversarial training technique introduced in the paper.
- Existing guard models have limited generalization and high false positive rates.
- The research is available on arXiv with ID 2605.15030.
- The paper focuses on security and efficiency of web agents.
Entities
Institutions
- arXiv