Safety Bottleneck Regularization Defends LLMs Against Harmful Fine-Tuning

ai-technology · 2026-05-09

A new approach called Safety Bottleneck Regularization (SBR) is designed to address how vulnerable Large Language Models (LLMs) are to Harmful Fine-tuning (HFT). Existing methods that restrict parameters or gradients can be circumvented due to the redundancy in high-dimensional spaces, allowing attackers to restore harmful capabilities. SBR focuses on the unembedding layer, which acts as a geometric bottleneck, by connecting the last hidden states from harmful inputs to those from a safety-aligned model. Tests show that SBR effectively maintains safe responses, even when HFT is still happening.

Key facts

SBR targets the unembedding layer as a geometric bottleneck.
Existing defenses are circumvented via orthogonal optimization trajectories.
High-dimensional parameter space redundancy enables HFT attacks.
SBR anchors harmful query hidden states to safety-aligned model states.
Experiments confirm SBR effectiveness under persistent HFT.
The paper is available on arXiv with ID 2605.05995.
SBR is a regularization technique for LLM safety alignment.
The defense does not rely on parameter or gradient constraints.

Safety Bottleneck Regularization Defends LLMs Against Harmful Fine-Tuning

Key facts

Entities

Institutions

Sources