SafeHarbor: Memory-Augmented Guardrail for LLM Agent Safety
Researchers propose SafeHarbor, a framework to improve safety in LLM agents without over-refusal. It uses context-aware defense rules from adversarial generation and a local hierarchical memory system for dynamic rule injection. The approach is training-free and plug-and-play.
Key facts
- arXiv:2605.05704
- SafeHarbor is a hierarchical memory-augmented guardrail
- Addresses over-refusal problem in LLM agent safety
- Extracts context-aware defense rules via enhanced adversarial generation
- Uses local hierarchical memory for dynamic rule injection
- Training-free, efficient, plug-and-play solution
- Introduces information entropy-based mechanism
Entities
Institutions
- arXiv