ARTFEED — Contemporary Art Intelligence

FATE: On-Policy Self-Evolution for Safer LLM Agents

ai-technology · 2026-05-13

A team of researchers has introduced FATE, a self-evolving framework that operates on policy, leveraging failure trajectories to enhance the safety alignment of tool-using LLM agents without needing expert demonstrations. Current safety signals tend to be either response-level or off-policy, resulting in a trade-off between safety and utility. FATE converts failures scored by verifiers into repair guidance: for each failure, the policy suggests potential repairs, which are then re-evaluated by verifiers based on security, utility, over-refusal control, and trajectory validity. This comprehensive trajectory-level data acts as a supervisory signal, improving agent safety while ensuring task performance is upheld.

Key facts

  • FATE is an on-policy self-evolving framework for agentic safety alignment.
  • It uses failure trajectories rather than only final responses.
  • Existing safety signals are response-level or off-policy.
  • FATE transforms verifier-scored failures into repair supervision without expert demonstrations.
  • Repair candidates are re-scored across security, utility, over-refusal control, and trajectory validity.
  • The framework aims to avoid safety-utility trade-offs.
  • Tool-using LLM agents may execute unsafe tool calls, follow injected instructions, comply with harmful requests, or over-refuse benign tasks.
  • The approach uses dense trajectory-level information as supervision signal.

Entities

Sources