ARTFEED — Contemporary Art Intelligence

SPARD: New Defense Against Harmful Fine-Tuning of LLMs

ai-technology · 2026-05-28

A novel defense strategy named SPARD (Safety-Projected Alternating optimization with Relevance-Diversity aware data selection) has been introduced to safeguard large language models against detrimental fine-tuning attacks. This framework utilizes SPAG (Safety-Projected Alternating Gradient) optimization, which alternates between updates for utility and explicit safety projections through a selection of safe data, thereby enforcing safety constraints during the fine-tuning process. To assemble this safe data, SPARD employs a Relevance-Diversity Determinantal Point Process (RD-DPP), which efficiently selects compact safe data that balances task relevance with safety coverage. Testing on GSM8K and OpenBookQA benchmarks revealed that SPARD consistently recorded the lowest average attack success rates against four harmful fine-tuning attacks, significantly surpassing existing defense techniques while ensuring high task accuracy. The code is accessible, and the paper can be found on arXiv under ID 2605.28030.

Key facts

  • SPARD defends against harmful fine-tuning attacks on LLMs
  • Uses SPAG optimization alternating utility updates and safety projections
  • Employs Relevance-Diversity Determinantal Point Process for safe data selection
  • Tested on GSM8K and OpenBookQA under four attack types
  • Achieves lowest average attack success rates compared to state-of-the-art
  • Maintains high task accuracy
  • Code is available
  • arXiv ID: 2605.28030

Entities

Institutions

  • arXiv

Sources