ARTFEED — Contemporary Art Intelligence

SQSD Method Quantifies Sample-Level Safety Degradation in LLM Fine-Tuning

ai-technology · 2026-05-07

A new study from arXiv (2605.04572) reveals that fine-tuning large language models (LLMs) on benign samples can cause cumulative parameter drift toward danger-aligned directions, progressively eroding safety behaviors learned from millions of preference examples. The authors propose Sample-Level Quantification of Safety Degradation (SQSD), a method that computes continuous risk scores for each training sample by analyzing parameter dynamics during fine-tuning. This approach identifies which samples contribute most to safety degradation, enabling targeted mitigation. The research highlights the fragility of safety alignment and provides a granular tool to assess fine-tuning risks at the sample level.

Key facts

  • Safety alignment of LLMs is fragile; fine-tuning on benign samples can erase safety behaviors.
  • Existing studies compare parameters before and after fine-tuning but ignore dynamic evolution.
  • Benign fine-tuning causes cumulative parameter drift toward danger-aligned directions.
  • Samples contributing more to drift pose greater fine-tuning risks.
  • SQSD quantifies influence of each training sample on safety degradation.
  • Method computes continuous risk scores for individual samples.
  • Study published on arXiv with ID 2605.04572.
  • Research provides granular tool for assessing fine-tuning risks.

Entities

Institutions

  • arXiv

Sources