SPARD: New Defense Against Harmful Fine-Tuning of LLMs

ai-technology · 2026-05-28

A novel defense strategy named SPARD (Safety-Projected Alternating optimization with Relevance-Diversity aware data selection) has been introduced to safeguard large language models against detrimental fine-tuning attacks. This framework utilizes SPAG (Safety-Projected Alternating Gradient) optimization, which alternates between updates for utility and explicit safety projections through a selection of safe data, thereby enforcing safety constraints during the fine-tuning process. To assemble this safe data, SPARD employs a Relevance-Diversity Determinantal Point Process (RD-DPP), which efficiently selects compact safe data that balances task relevance with safety coverage. Testing on GSM8K and OpenBookQA benchmarks revealed that SPARD consistently recorded the lowest average attack success rates against four harmful fine-tuning attacks, significantly surpassing existing defense techniques while ensuring high task accuracy. The code is accessible, and the paper can be found on arXiv under ID 2605.28030.

Key facts

SPARD defends against harmful fine-tuning attacks on LLMs
Uses SPAG optimization alternating utility updates and safety projections
Employs Relevance-Diversity Determinantal Point Process for safe data selection
Tested on GSM8K and OpenBookQA under four attack types
Achieves lowest average attack success rates compared to state-of-the-art
Maintains high task accuracy
Code is available
arXiv ID: 2605.28030

SPARD: New Defense Against Harmful Fine-Tuning of LLMs

Key facts

Entities

Institutions

Sources