ClawTrojan Benchmark Exposes Multi-Step Trojan Attacks on LLM Agents
ClawTrojan has been developed by researchers as a benchmark to detect multi-step trojan attacks in local agentic harnesses. These systems allow LLM agents to read and write files, utilize tools, and maintain workspace states across different sessions, transitioning from simple chatbots to functional tools. Attackers can insert a prompt injection within the output of a file or tool, which an agent might later read, store, and execute. Although each step in this multi-step process seems harmless individually, they collectively transform untrusted text into a means of persistent control. Current defenses focus on inspecting steps in isolation, successfully blocking overt harmful actions but overlooking the initial write operation that establishes the backdoor. ClawTrojan seeks to expose this vulnerability.
Key facts
- ClawTrojan is a benchmark for multi-step trojan attacks in local agentic harnesses.
- LLM agents can read/write files, call tools, and reuse workspace state across sessions.
- Attackers embed prompt injections in files or tool outputs.
- Multi-step attacks appear benign individually but collectively enable persistent control.
- Existing defenses inspect each step in isolation, missing the backdoor planting.
- The research is published on arXiv with ID 2605.31042.
- The paper is a cross-type announcement.
- The threat model involves agents storing and later executing hidden instructions.
Entities
Institutions
- arXiv