MentalMap Benchmark Reveals LLMs' Spatial Reasoning Cliff Across Languages
A recent research effort has unveiled MentalMap, a multilingual benchmark designed to assess spatial world models within large language models (LLMs). This benchmark is organized into a six-tier capability hierarchy (L0-L5), ranging from basic spatial facts to the creation of generative world graphs, and evaluates four diagnostic dimensions: frame of reference, reading-direction bias, reasoning-effort allocation, and hallucination. Developed using 100 ProcTHOR household scenes, MentalMap encompasses eight diverse languages and a structured-text control, featuring 39 task families across 1,950 evaluation cells. By testing thirteen LLMs of various scales and families, researchers discovered a universal L3 reasoning cliff, revealing that no model maintains even half of its L0 performance on viewpoint reasoning when baseline atomic accuracy surpasses 40%. This study questions the assertion that LLMs can form robust internal spatial models solely from text and underscores the limitations of language transfer.
Key facts
- MentalMap is a multilingual diagnostic benchmark for spatial reasoning in LLMs.
- It has a six-level capability hierarchy (L0-L5) from atomic facts to generative world-graph construction.
- Four diagnostic axes: frame of reference, reading-direction bias, reasoning-effort allocation, and hallucination.
- Built from 100 ProcTHOR household scenes.
- Covers eight typologically diverse languages plus a structured-text control.
- Contains 39 task families across 1,950 evaluation cells.
- Thirteen LLMs were evaluated across scales and model families.
- A universal L3 reasoning cliff was identified: no model retains half of L0 performance on viewpoint reasoning when baseline atomic accuracy exceeds 40%.
Entities
—