FATE: On-Policy Self-Evolution for Safer LLM Agents

ai-technology · 2026-05-13

A team of researchers has introduced FATE, a self-evolving framework that operates on policy, leveraging failure trajectories to enhance the safety alignment of tool-using LLM agents without needing expert demonstrations. Current safety signals tend to be either response-level or off-policy, resulting in a trade-off between safety and utility. FATE converts failures scored by verifiers into repair guidance: for each failure, the policy suggests potential repairs, which are then re-evaluated by verifiers based on security, utility, over-refusal control, and trajectory validity. This comprehensive trajectory-level data acts as a supervisory signal, improving agent safety while ensuring task performance is upheld.

Key facts

FATE is an on-policy self-evolving framework for agentic safety alignment.
It uses failure trajectories rather than only final responses.
Existing safety signals are response-level or off-policy.
FATE transforms verifier-scored failures into repair supervision without expert demonstrations.
Repair candidates are re-scored across security, utility, over-refusal control, and trajectory validity.
The framework aims to avoid safety-utility trade-offs.
Tool-using LLM agents may execute unsafe tool calls, follow injected instructions, comply with harmful requests, or over-refuse benign tasks.
The approach uses dense trajectory-level information as supervision signal.

Entities

—

Sources

arXiv cs.AI — 2026-05-13