SafeSteer: Decoding-Level Defense for Multimodal LLMs

ai-technology · 2026-05-13

Researchers have introduced SafeSteer, a decoding-level defense mechanism for multimodal large language models (MLLMs) that addresses jailbreak attacks without costly fine-tuning. The method leverages a lightweight Decoding-Probe to detect and correct harmful outputs during the decoding process, iteratively steering the model toward safety. The work, published on arXiv (2605.11716), observes that MLLMs can distinguish harmful from harmless inputs at decoding stage and that image-based attacks are more stealthy. SafeSteer aims to overcome limitations of current defenses, which rely on expensive fine-tuning or inefficient post-hoc interventions and often involve performance trade-offs.

Key facts

SafeSteer is a decoding-level defense mechanism for MLLMs.
It uses a lightweight Decoding-Probe to detect and correct harmful outputs.
The method iteratively steers decoding toward safety.
Current defenses rely on costly fine-tuning or inefficient post-hoc interventions.
MLLMs can distinguish harmful and harmless inputs during decoding.
Image-based attacks are more stealthy than text-based ones.
The research is published on arXiv with ID 2605.11716.
SafeSteer addresses novel attacks without performance trade-offs.

SafeSteer: Decoding-Level Defense for Multimodal LLMs

Key facts

Entities

Institutions

Sources