EPO-Safe: LLM Agents Learn Safety from 1-Bit Danger Signals

ai-technology · 2026-04-29

Researchers have introduced a new framework named EPO-Safe, which stands for Experiential Prompt Optimization for Safe Agents. This innovative system enables large language model agents to grasp safety objectives purely through experience. EPO-Safe builds action plans step by step, responding to very simple danger signals (just one bit per time step) and improves its language behavior through reflection. Instead of relying on extensive text like traditional methods, it operates effectively with limited information in structured environments. The agent doesn’t access the hidden performance function R*; it just needs one bit to flag unsafe actions. Tests conducted on five AI Safety Gridworlds, as well as five similar text scenarios, show that EPO-Safe quickly identifies safe behaviors within one to two rounds, demonstrating a promising avenue for safety reasoning in autonomous agents.

Key facts

EPO-Safe framework uses 1-bit danger signals for safety learning
LLM agents generate action plans and receive binary warnings
No access to hidden performance function R*
Evaluated on five AI Safety Gridworlds and five text-based analogs
Safe behavior discovered within 1-2 rounds
Contrasts with standard LLM reflection methods needing detailed feedback
Framework evolves natural language behavioral specification through reflection
Published on arXiv (2604.23210)

EPO-Safe: LLM Agents Learn Safety from 1-Bit Danger Signals

Key facts

Entities

Institutions

Sources