Attention Redistribution Attack Bypasses LLM Safety Alignment

ai-technology · 2026-05-04

Researchers have introduced a groundbreaking method called the Attention Redistribution Attack (ARA) that targets safety flaws in large language models by manipulating their attention systems. This technique identifies crucial attention heads responsible for safety and creates adversarial tokens that redirect focus away from sensitive areas, using a Gumbel-softmax optimization approach. Unlike previous jailbreak methods that functioned on semantic levels, ARA achieves notable success rates with minimal tokens and fewer optimization steps. When tested on models like LLaMA-3-8B-Instruct and Mistral-7B-Instruct-v0.1, it effectively bypasses safety measures using only 5 tokens and 500 steps, achieving 36% success on Mistral-7B. The full research is available on arXiv under the identifier 2605.00236.

Key facts

ARA is a white-box adversarial attack targeting safety-critical attention heads.
It uses nonsemantic adversarial tokens to redirect attention away from safety-relevant positions.
The attack operates on the geometry of softmax attention using Gumbel-softmax optimization.
Tested on LLaMA-3-8B-Instruct, Mistral-7B-Instruct-v0.1, and Gemma-2-9B-it.
Achieves 36% ASR on Mistral-7B and 30% on LLaMA-3 against 200 HarmBench prompts.
Gemma-2 remains at 1% ASR.
Requires as few as 5 tokens and 500 optimization steps.
The paper is on arXiv with ID 2605.00236.

Attention Redistribution Attack Bypasses LLM Safety Alignment

Key facts

Entities

Institutions

Sources