REFUSALGUARD: Preserving LLM Safety During Fine-Tuning
A new arXiv paper (2605.01913) introduces REFUSALGUARD, a framework to maintain safety in large language models during fine-tuning. Standard fine-tuning degrades refusal behavior by distorting safety-relevant representations in activation space, increasing harmful compliance. REFUSALGUARD preserves the geometric structure of these representations, preventing alignment degradation.
Key facts
- arXiv paper 2605.01913 introduces REFUSALGUARD
- Standard fine-tuning degrades safety-aligned LLM refusal behavior
- Safety-relevant features are encoded in structured representations in activation space
- Fine-tuning induces systematic drift and distortion in safety representations
- Interference between task optimization and safety features increases harmful compliance
- REFUSALGUARD is a representation-level fine-tuning framework
- REFUSALGUARD preserves safety-relevant structure during fine-tuning
Entities
Institutions
- arXiv