Disentangled Safety Adapters Enable Efficient AI Guardrails
Researchers propose Disentangled Safety Adapters (DSA), a framework that decouples safety computations from a task-optimized base model using lightweight adapters. DSA-based guardrails outperform standalone models by up to 53% in AUC on hate speech classification, unsafe input/output detection, and hallucination detection. The approach allows dynamic, inference-time adjustment of alignment strength and fine-grained trade-offs with instruction following, minimizing inference cost.
Key facts
- DSA decouples safety-specific computations from a task-optimized base model.
- DSA uses lightweight adapters that leverage the base model's internal representations.
- DSA-based guardrails outperform comparably sized standalone models by up to 53% in AUC.
- Tasks include hate speech classification, detecting unsafe inputs/responses, and hallucination detection.
- DSA enables dynamic, inference-time adjustment of alignment strength.
- DSA allows fine-grained trade-off between instruction following and safety.
- The framework addresses efficiency and flexibility challenges in existing AI safety paradigms.
- The paper is available on arXiv under identifier 2506.00166.
Entities
Institutions
- arXiv