Attention-Guided RL Jailbreak Method Targets Large Reasoning Models

ai-technology · 2026-05-20

A new study reveals that Large Reasoning Models (LRMs) are more vulnerable to jailbreak attacks than standard LLMs due to their exposed internal reasoning. Researchers found that attack success correlates with attention patterns: successful jailbreaks assign lower attention to harmful input tokens but higher attention to them in reasoning content. They propose a reinforcement learning-based jailbreak method that incorporates attention signals into the reward function to enhance attack effectiveness.

Key facts

LRMs generate structured, step-by-step reasoning content.
LRMs are more vulnerable to jailbreak attacks than standard LLMs.
Attack success rate correlates with LRMs' attention patterns.
Successful jailbreaks assign lower attention to harmful tokens in input prompts.
Successful jailbreaks assign higher attention to harmful tokens in reasoning content.
Proposed method uses reinforcement learning with attention-guided reward.
Method introduces diverse prompts for attack generation.
Study published on arXiv with ID 2605.19485.

Attention-Guided RL Jailbreak Method Targets Large Reasoning Models

Key facts

Entities

Institutions

Sources