Safety-Aware Probing Framework Prevents LLM Safety Degradation During Fine-Tuning

ai-technology · 2026-04-25

A new study from arXiv (2505.16737) introduces a safety-aware probing (SAP) optimization framework to prevent large language models from losing their safety alignment during fine-tuning. The researchers demonstrate that safety and task-performance loss landscapes are partially decoupled, meaning updates that improve task-specific performance can inadvertently shift the model toward unsafe regions. SAP uses contrastive safety signals to identify safety-correlated directions and optimizes a lightweight probe to maintain safety constraints. The paper revisits the fundamental question of why fine-tuning on non-harmful data can degrade safety, offering a solution to preserve alignment without sacrificing task performance.

Key facts

arXiv paper 2505.16737 introduces SAP framework
Safety and task-performance loss landscapes are partially decoupled
Fine-tuning on non-harmful data can still compromise safety
SAP uses contrastive safety signals to locate safety-correlated directions
A lightweight probe is optimized to maintain safety during fine-tuning
The study addresses safety degradation from adversarial or benign fine-tuning data
SAP aims to preserve safety alignment without harming task performance

Safety-Aware Probing Framework Prevents LLM Safety Degradation During Fine-Tuning

Key facts

Entities

Institutions

Sources