AutoRISE: Agent-Driven Strategy Evolution for Red-Teaming LLMs
AutoRISE introduces an innovative approach for the automated red-teaming of extensive language models, focusing on optimizing attack strategies instead of individual prompts. It employs a coding agent that progressively modifies executable attack programs, driven by a scalar objective and example-specific diagnostics from an evaluation harness. This allows for structural modifications, such as the addition of new attack elements and changes in control flow, which are beyond the capabilities of prompt-level techniques. The method was tested on 11 models across five families, utilizing seven jailbreak datasets and two new benchmark suites. AutoRISE achieved a 17.0-point increase in the average attack success rate compared to the strongest baseline on held-out models.
Key facts
- AutoRISE searches over executable attack programs instead of individual prompts.
- A coding agent edits a strategy at each iteration.
- A fixed evaluation harness scores attacks and provides diagnostics.
- Structural changes include new attack components and altered control flow.
- Evaluated on 11 models from five families.
- Used seven established jailbreak datasets.
- Two benchmark suites developed on disjoint target sets.
- Improves average attack success rate by 17.0 points over strongest baseline.
Entities
Institutions
- arXiv