RAG-Pref: Training-Free LLM Alignment via Retrieval Augmented Generation
A new method called Retrieval Augmented Generation for Preference Alignment (RAG-Pref) improves LLM refusal guardrails against agentic attacks without the computational overhead of traditional alignment algorithms. RAG-Pref is a training-free, online algorithm that conditions on preferred and dispreferred samples during inference to leverage contrastive information. When combined with offline training-based alignment, it achieves over a 3.7x improvement in agentic attack refusal. The approach is compatible with off-the-shelf packages and addresses the gap where state-of-the-art alignment algorithms require significant resources yet remain vulnerable to recent attacks.
Key facts
- RAG-Pref is a training-free alignment algorithm
- It uses retrieval augmented generation for preference alignment
- Conditions on preferred and dispreferred samples during inference
- Combined with offline alignment yields over 3.7x improvement in agentic attack refusal
- Addresses computational resource demands of traditional alignment
- Compatible with off-the-shelf packages
- Targets refusal guardrails against agentic attacks
- Introduced in arXiv:2605.11217
Entities
Institutions
- arXiv