SafeMERGE: Selective Layer Merging Restores LLM Safety After Fine-Tuning
Researchers have introduced SafeMERGE, a streamlined post-fine-tuning framework designed to enhance safety alignment in large language models (LLMs) while preserving their task performance. Fine-tuning generalist LLMs for specific areas often diminishes their capacity to reject harmful prompts. Current realignment strategies tend to be either difficult to execute or compromise effectiveness. SafeMERGE selectively integrates layers from a safety-aligned model into the fine-tuned version only when those layers show a divergence from safe behavior, as indicated by cosine similarity. Evaluated across four LLMs and various tasks, SafeMERGE consistently minimizes harmful outputs compared to other methods, with little to no negative effect on performance. This approach serves as a reliable and user-friendly safeguard for LLM safety.
Key facts
- SafeMERGE is a post-fine-tuning framework for LLMs
- It restores safety alignment eroded by fine-tuning
- Selectively merges layers from safety-aligned model
- Uses cosine similarity criterion to detect deviation
- Tested on four LLMs and multiple tasks
- Reduces harmful outputs compared to other defenses
- Negligible or positive impact on task utility
- Lightweight and easy to implement
Entities
—