EPO-Safe: LLM Agents Learn Safety from 1-Bit Danger Signals
Researchers have introduced a new framework named EPO-Safe, which stands for Experiential Prompt Optimization for Safe Agents. This innovative system enables large language model agents to grasp safety objectives purely through experience. EPO-Safe builds action plans step by step, responding to very simple danger signals (just one bit per time step) and improves its language behavior through reflection. Instead of relying on extensive text like traditional methods, it operates effectively with limited information in structured environments. The agent doesn’t access the hidden performance function R*; it just needs one bit to flag unsafe actions. Tests conducted on five AI Safety Gridworlds, as well as five similar text scenarios, show that EPO-Safe quickly identifies safe behaviors within one to two rounds, demonstrating a promising avenue for safety reasoning in autonomous agents.
Key facts
- EPO-Safe framework uses 1-bit danger signals for safety learning
- LLM agents generate action plans and receive binary warnings
- No access to hidden performance function R*
- Evaluated on five AI Safety Gridworlds and five text-based analogs
- Safe behavior discovered within 1-2 rounds
- Contrasts with standard LLM reflection methods needing detailed feedback
- Framework evolves natural language behavioral specification through reflection
- Published on arXiv (2604.23210)
Entities
Institutions
- arXiv