Furina Attack Exploits LLM Safety Instability Region

ai-technology · 2026-05-27

A newly identified attack strategy, dubbed Furina, targets vulnerabilities in large language models (LLMs) and their multimodal counterparts. Researchers have highlighted that small changes in input can lead to erratic responses from these systems, revealing the complexity of safety alignment. Their findings, documented in a recent study on arXiv (2605.26158), suggest that current detection mechanisms struggle against advanced threats due to this unpredictability. The diagnostic tool they created assesses multiple signals, demonstrating that while unpredictable outputs arise in vulnerable areas, the system's internal safety responses remain compromised. Notably, Furina can be executed using segmented, scene-oriented prompts without altering the models directly.

Key facts

Furina is a jailbreak attack targeting LLMs and MLLMs
Safety behavior has an instability region with stochastic refusal decisions
Multi-metric diagnostic framework uses external and internal signals
Decoupling: high output uncertainty, low internal safety activation
Detection-based defenses fail against such attacks
Furina uses fragmented, scene-anchored prompts
No model-specific optimization required
Published on arXiv with ID 2605.26158

Furina Attack Exploits LLM Safety Instability Region

Key facts

Entities

Institutions

Sources