MentalMap Benchmark Reveals LLMs' Spatial Reasoning Cliff Across Languages

ai-technology · 2026-05-28

A recent research effort has unveiled MentalMap, a multilingual benchmark designed to assess spatial world models within large language models (LLMs). This benchmark is organized into a six-tier capability hierarchy (L0-L5), ranging from basic spatial facts to the creation of generative world graphs, and evaluates four diagnostic dimensions: frame of reference, reading-direction bias, reasoning-effort allocation, and hallucination. Developed using 100 ProcTHOR household scenes, MentalMap encompasses eight diverse languages and a structured-text control, featuring 39 task families across 1,950 evaluation cells. By testing thirteen LLMs of various scales and families, researchers discovered a universal L3 reasoning cliff, revealing that no model maintains even half of its L0 performance on viewpoint reasoning when baseline atomic accuracy surpasses 40%. This study questions the assertion that LLMs can form robust internal spatial models solely from text and underscores the limitations of language transfer.

Key facts

MentalMap is a multilingual diagnostic benchmark for spatial reasoning in LLMs.
It has a six-level capability hierarchy (L0-L5) from atomic facts to generative world-graph construction.
Four diagnostic axes: frame of reference, reading-direction bias, reasoning-effort allocation, and hallucination.
Built from 100 ProcTHOR household scenes.
Covers eight typologically diverse languages plus a structured-text control.
Contains 39 task families across 1,950 evaluation cells.
Thirteen LLMs were evaluated across scales and model families.
A universal L3 reasoning cliff was identified: no model retains half of L0 performance on viewpoint reasoning when baseline atomic accuracy exceeds 40%.

Entities

—

Sources

arXiv cs.AI — 2026-05-28