DialToM Benchmark Reveals LLMs Struggle with Social Forecasting

ai-technology · 2026-04-24

DialToM has been developed by researchers as a human-verified benchmark derived from authentic human conversations to assess Theory of Mind (ToM) in Large Language Models (LLMs). This benchmark employs a multiple-choice format to evaluate both Literal ToM, which focuses on direct mental state predictions, and Functional ToM, which examines the practical application of these states through Prospective Diagnostic Forecasting. This method tests the ability of models to recognize dialogue paths consistent with mental-state profiles. Findings reveal a notable reasoning gap: while LLMs are proficient in recognizing mental states, most, with the exception of Gemini 3 Pro, struggle to utilize this knowledge for predicting social interactions. Weak semantic similarities were also observed between human and LLM inferences. The DialToM dataset and evaluation code are publicly available to ensure reproducibility. The research paper can be found on arXiv under ID 2604.20443.

Key facts

DialToM is a human-verified benchmark built from natural human dialogue.
The benchmark uses a multiple-choice framework.
It evaluates Literal ToM (mental state prediction) and Functional ToM (forecasting social trajectories).
Prospective Diagnostic Forecasting tests models on identifying state-consistent dialogue trajectories.
Most LLMs, except Gemini 3 Pro, fail at forecasting social trajectories despite excelling at mental state identification.
Only weak semantic similarities exist between human and LLM-generated inferences.
The DialToM dataset and evaluation code are publicly available.
The paper is on arXiv (ID 2604.20443).

DialToM Benchmark Reveals LLMs Struggle with Social Forecasting

Key facts

Entities

Institutions

Sources