FATE: On-Policy Self-Evolution for Safer LLM Agents
A team of researchers has introduced FATE, a self-evolving framework that operates on policy, leveraging failure trajectories to enhance the safety alignment of tool-using LLM agents without needing expert demonstrations. Current safety signals tend to be either response-level or off-policy, resulting in a trade-off between safety and utility. FATE converts failures scored by verifiers into repair guidance: for each failure, the policy suggests potential repairs, which are then re-evaluated by verifiers based on security, utility, over-refusal control, and trajectory validity. This comprehensive trajectory-level data acts as a supervisory signal, improving agent safety while ensuring task performance is upheld.
Key facts
- FATE is an on-policy self-evolving framework for agentic safety alignment.
- It uses failure trajectories rather than only final responses.
- Existing safety signals are response-level or off-policy.
- FATE transforms verifier-scored failures into repair supervision without expert demonstrations.
- Repair candidates are re-scored across security, utility, over-refusal control, and trajectory validity.
- The framework aims to avoid safety-utility trade-offs.
- Tool-using LLM agents may execute unsafe tool calls, follow injected instructions, comply with harmful requests, or over-refuse benign tasks.
- The approach uses dense trajectory-level information as supervision signal.
Entities
—