REFUSALGUARD: Preserving LLM Safety During Fine-Tuning

ai-technology · 2026-05-06

A new arXiv paper (2605.01913) introduces REFUSALGUARD, a framework to maintain safety in large language models during fine-tuning. Standard fine-tuning degrades refusal behavior by distorting safety-relevant representations in activation space, increasing harmful compliance. REFUSALGUARD preserves the geometric structure of these representations, preventing alignment degradation.

Key facts

arXiv paper 2605.01913 introduces REFUSALGUARD
Standard fine-tuning degrades safety-aligned LLM refusal behavior
Safety-relevant features are encoded in structured representations in activation space
Fine-tuning induces systematic drift and distortion in safety representations
Interference between task optimization and safety features increases harmful compliance
REFUSALGUARD is a representation-level fine-tuning framework
REFUSALGUARD preserves safety-relevant structure during fine-tuning

REFUSALGUARD: Preserving LLM Safety During Fine-Tuning

Key facts

Entities

Institutions

Sources