SafeHarbor: Memory-Augmented Guardrail for LLM Agent Safety

ai-technology · 2026-05-09

Researchers propose SafeHarbor, a framework to improve safety in LLM agents without over-refusal. It uses context-aware defense rules from adversarial generation and a local hierarchical memory system for dynamic rule injection. The approach is training-free and plug-and-play.

Key facts

arXiv:2605.05704
SafeHarbor is a hierarchical memory-augmented guardrail
Addresses over-refusal problem in LLM agent safety
Extracts context-aware defense rules via enhanced adversarial generation
Uses local hierarchical memory for dynamic rule injection
Training-free, efficient, plug-and-play solution
Introduces information entropy-based mechanism

SafeHarbor: Memory-Augmented Guardrail for LLM Agent Safety

Key facts

Entities

Institutions

Sources