ARTFEED — Contemporary Art Intelligence

Furina Attack Exploits LLM Safety Instability Region

ai-technology · 2026-05-27

A newly identified attack strategy, dubbed Furina, targets vulnerabilities in large language models (LLMs) and their multimodal counterparts. Researchers have highlighted that small changes in input can lead to erratic responses from these systems, revealing the complexity of safety alignment. Their findings, documented in a recent study on arXiv (2605.26158), suggest that current detection mechanisms struggle against advanced threats due to this unpredictability. The diagnostic tool they created assesses multiple signals, demonstrating that while unpredictable outputs arise in vulnerable areas, the system's internal safety responses remain compromised. Notably, Furina can be executed using segmented, scene-oriented prompts without altering the models directly.

Key facts

  • Furina is a jailbreak attack targeting LLMs and MLLMs
  • Safety behavior has an instability region with stochastic refusal decisions
  • Multi-metric diagnostic framework uses external and internal signals
  • Decoupling: high output uncertainty, low internal safety activation
  • Detection-based defenses fail against such attacks
  • Furina uses fragmented, scene-anchored prompts
  • No model-specific optimization required
  • Published on arXiv with ID 2605.26158

Entities

Institutions

  • arXiv

Sources