SQSD Method Quantifies Sample-Level Safety Degradation in LLM Fine-Tuning
A new study from arXiv (2605.04572) reveals that fine-tuning large language models (LLMs) on benign samples can cause cumulative parameter drift toward danger-aligned directions, progressively eroding safety behaviors learned from millions of preference examples. The authors propose Sample-Level Quantification of Safety Degradation (SQSD), a method that computes continuous risk scores for each training sample by analyzing parameter dynamics during fine-tuning. This approach identifies which samples contribute most to safety degradation, enabling targeted mitigation. The research highlights the fragility of safety alignment and provides a granular tool to assess fine-tuning risks at the sample level.
Key facts
- Safety alignment of LLMs is fragile; fine-tuning on benign samples can erase safety behaviors.
- Existing studies compare parameters before and after fine-tuning but ignore dynamic evolution.
- Benign fine-tuning causes cumulative parameter drift toward danger-aligned directions.
- Samples contributing more to drift pose greater fine-tuning risks.
- SQSD quantifies influence of each training sample on safety degradation.
- Method computes continuous risk scores for individual samples.
- Study published on arXiv with ID 2605.04572.
- Research provides granular tool for assessing fine-tuning risks.
Entities
Institutions
- arXiv