Agent-ToM: Theory-of-Mind Monitoring for LLM Agents

ai-technology · 2026-05-26

A new framework called Agent-ToM uses Theory-of-Mind reasoning to monitor autonomous LLM agents for covert malicious behavior. The approach infers agent beliefs, intentions, and goal alignment from full trajectory data, addressing limitations of prior methods that treat each trajectory independently. Agent-ToM is designed to detect delayed, context-dependent, and long-horizon attack patterns where agents pursue hidden objectives while appearing benign. The framework learns from prior monitoring experience, improving detection over time. The paper is available on arXiv under ID 2605.24216.

Key facts

Agent-ToM is a learning-to-monitor framework for LLM agents.
It uses Theory-of-Mind reasoning for security analysis.
It analyzes full trajectories to infer beliefs and intentions.
Prior methods treat each trajectory independently.
Agent-ToM learns from prior monitoring experience.
It detects covert malicious behavior with delayed patterns.
The paper is on arXiv: 2605.24216.
The approach distinguishes benign tasks from covert deviation.

Agent-ToM: Theory-of-Mind Monitoring for LLM Agents

Key facts

Entities

Institutions

Sources