Adaptive Instruction Composition Boosts LLM Jailbreak Diversity

ai-technology · 2026-04-25

A new framework called Adaptive Instruction Composition (AIC) improves automated red-teaming of large language models by combining crowdsourced harmful queries and tactics adaptively rather than randomly. The approach uses reinforcement learning to balance exploration and exploitation in a combinatorial instruction space, guiding an attacker LLM to generate diverse jailbreaks tailored to target vulnerabilities. Experiments show AIC substantially outperforms random combinations on effectiveness and diversity metrics. The work was published on arXiv (2604.21159) and represents a step toward more robust LLM safety evaluation.

Key facts

Adaptive Instruction Composition (AIC) is a novel framework for LLM red-teaming.
It combines crowdsourced texts adaptively using reinforcement learning.
The method jointly optimizes effectiveness and diversity of jailbreaks.
It outperforms random combination on effectiveness and diversity.
The paper is on arXiv with ID 2604.21159.
The approach addresses limitations of prior trial-and-error and random combination methods.
Reinforcement learning balances exploration and exploitation in instruction space.
The attacker LLM is guided toward diverse generations tailored to target vulnerabilities.

Adaptive Instruction Composition Boosts LLM Jailbreak Diversity

Key facts

Entities

Institutions

Sources