SafeMERGE: Selective Layer Merging Restores LLM Safety After Fine-Tuning

ai-technology · 2026-04-25

Researchers have introduced SafeMERGE, a streamlined post-fine-tuning framework designed to enhance safety alignment in large language models (LLMs) while preserving their task performance. Fine-tuning generalist LLMs for specific areas often diminishes their capacity to reject harmful prompts. Current realignment strategies tend to be either difficult to execute or compromise effectiveness. SafeMERGE selectively integrates layers from a safety-aligned model into the fine-tuned version only when those layers show a divergence from safe behavior, as indicated by cosine similarity. Evaluated across four LLMs and various tasks, SafeMERGE consistently minimizes harmful outputs compared to other methods, with little to no negative effect on performance. This approach serves as a reliable and user-friendly safeguard for LLM safety.

Key facts

SafeMERGE is a post-fine-tuning framework for LLMs
It restores safety alignment eroded by fine-tuning
Selectively merges layers from safety-aligned model
Uses cosine similarity criterion to detect deviation
Tested on four LLMs and multiple tasks
Reduces harmful outputs compared to other defenses
Negligible or positive impact on task utility
Lightweight and easy to implement

Entities

—

Sources

arXiv cs.AI — 2026-04-25