Safety Bottleneck Regularization Defends LLMs Against Harmful Fine-Tuning
A new approach called Safety Bottleneck Regularization (SBR) is designed to address how vulnerable Large Language Models (LLMs) are to Harmful Fine-tuning (HFT). Existing methods that restrict parameters or gradients can be circumvented due to the redundancy in high-dimensional spaces, allowing attackers to restore harmful capabilities. SBR focuses on the unembedding layer, which acts as a geometric bottleneck, by connecting the last hidden states from harmful inputs to those from a safety-aligned model. Tests show that SBR effectively maintains safe responses, even when HFT is still happening.
Key facts
- SBR targets the unembedding layer as a geometric bottleneck.
- Existing defenses are circumvented via orthogonal optimization trajectories.
- High-dimensional parameter space redundancy enables HFT attacks.
- SBR anchors harmful query hidden states to safety-aligned model states.
- Experiments confirm SBR effectiveness under persistent HFT.
- The paper is available on arXiv with ID 2605.05995.
- SBR is a regularization technique for LLM safety alignment.
- The defense does not rely on parameter or gradient constraints.
Entities
Institutions
- arXiv