EnactToM Benchmark Tests Functional Theory of Mind in AI Agents

ai-technology · 2026-05-12

A new benchmark called EnactToM has been developed by researchers to assess functional Theory of Mind (ToM) in embodied AI agents. Unlike traditional benchmarks that focus on direct belief inquiries, EnactToM evaluates how well agents can act based on implicit beliefs in multi-agent settings. This benchmark features 300 tasks within a 3D household environment characterized by limited visibility, private data, and restricted communication. Each task is rigorously validated for solvability and necessary epistemic depth, with additional tasks created to raise challenge levels as models advance. In the hard split, all seven leading models achieved a 0.0% Pass^3 on functional tasks, while averaging 45.0% on literal belief assessments. Analysis revealed that 93% of failures stemmed from issues in epistemic coordination, such as information withholding, underscoring a notable disparity between literal and functional ToM in existing AI systems.

Key facts

EnactToM is an evolving benchmark for functional Theory of Mind in embodied agents.
It consists of 300 multi-agent tasks in a 3D household environment.
Tasks involve partial observability, private information, and constrained communication.
All seven frontier models scored 0.0% Pass^3 on the hard split for functional task completion.
Models averaged 45.0% on literal belief probes.
93% of failures were due to epistemic coordination breakdowns.
New tasks are generated to increase difficulty as models improve.
The benchmark is formally verified for solvability and epistemic depth.

EnactToM Benchmark Tests Functional Theory of Mind in AI Agents

Key facts

Entities

Institutions

Sources