Buffer-and-Reinforce Framework Defends LLMs Against Harmful Fine-Tuning
A new arXiv paper (2605.24550) proposes a Buffer-and-Reinforce fine-tuning framework to protect large language models from safety degradation during Fine-tuning-as-a-Service (FaaS). The authors revisit temporary jailbreaking as a defense, providing gradient-level analysis showing it saturates safety-degrading gradients while preserving benign task-relevant gradients. The framework consists of BufferLoRA, a removable adapter that induces temporary jailbreaking to reduce harmful updates during user fine-tuning, and ReinforceLoRA, trained to recover refusal behavior after adaptation. This mechanism prevents models from learning undesired behaviors under harmful fine-tuning attacks, addressing a key vulnerability in LLM personalization.
Key facts
- Paper published on arXiv with ID 2605.24550
- Proposes Buffer-and-Reinforce fine-tuning framework
- Uses temporary jailbreaking as a defense
- Gradient-level analysis shows saturation of safety-degrading gradients
- BufferLoRA acts as removable adapter for temporary jailbreaking
- ReinforceLoRA recovers refusal behavior after adaptation
- Addresses harmful fine-tuning attacks in Fine-tuning-as-a-Service (FaaS)
- Preserves benign task-relevant gradients
Entities
Institutions
- arXiv