Safety Geometry Collapse in Multimodal LLMs and Adaptive Drift Correction

ai-technology · 2026-05-20

A new study from arXiv (2605.18104) reveals that multimodal large language models (MLLMs) fail to transfer safety capabilities from text to non-text inputs, a phenomenon termed Safety Geometry Collapse. Researchers analyzed a text-aligned refusal direction and a modality-induced drift direction, showing that multimodal inputs compress usable separation along the refusal direction, making it unreliable for identifying harmful inputs. They quantified this via conditional refusal separability, finding stronger drift correlates with weaker separability and higher attack success rates. A fixed-strength activation intervention counteracting estimated drift restored refusal separability, suggesting a potential correction method.

Key facts

Multimodal LLMs fail to transfer safety capabilities from text to non-text inputs.
The failure is termed Safety Geometry Collapse.
A text-aligned refusal direction and modality-induced drift direction were analyzed.
Multimodal inputs compress usable separation along the refusal direction.
Stronger modality-induced drift is associated with weaker refusal separability.
Higher attack success rates correlate with stronger drift.
A fixed-strength activation intervention counteracting drift restored refusal separability.
The study is from arXiv preprint 2605.18104.

Safety Geometry Collapse in Multimodal LLMs and Adaptive Drift Correction

Key facts

Entities

Institutions

Sources