Benign Fine-Tuning Destroys Safety Alignment in Guard Models

ai-technology · 2026-05-07

A recent investigation indicates that fine-tuning guard models solely on harmless data can severely compromise their safety alignment. Researchers showcased this issue across three safety classifiers—LlamaGuard, WildGuard, and Granite Guardian—integrated as protective layers in agentic AI systems. The failure stems from the disruption of latent safety geometry, which defines the harmful-benign representational boundary essential for classification. By employing SVD on class-conditional activation differences, they identified per-layer safety subspaces and monitored boundary changes during benign fine-tuning. Granite Guardian experienced a total breakdown, with its refusal rate plummeting from 85% to 0%, CKA dropping to zero, and 100% of outputs becoming unclear. This level of severity surpasses previous findings on general-purpose LLMs, attributed to the specialization hypothesis: while concentrated safety representations are effective, they are also delicate.

Key facts

Fine-tuning on benign data can destroy safety alignment in guard models.
Three guard models tested: LlamaGuard, WildGuard, Granite Guardian.
Granite Guardian refusal rate dropped from 85% to 0%.
CKA similarity fell to zero for Granite Guardian.
100% of Granite Guardian outputs became ambiguous.
Failure originates from destruction of latent safety geometry.
Researchers used SVD on class-conditional activation differences.
Specialization hypothesis: concentrated safety representations are efficient but fragile.

Entities

—

Sources

arXiv cs.AI — 2026-05-06