MLLMs Face 'Cartesian Illusion' in Spatial Reasoning Tasks

ai-technology · 2026-05-20

A new paper on arXiv (2605.18194) exposes a fundamental limitation in Multi-Modal Large Language Models (MLLMs): their spatial intelligence is hampered by a 'Cartesian Illusion,' meaning they rely on text-based probability distributions rather than grounded 3D topological understanding. This deficiency becomes critical in multi-agent environments requiring second-order Theory of Mind (ToM)—where an agent must infer another agent's beliefs based on its physical orientation and sensory limits. The authors probe these limits with a novel audio-visual task: Agent A predicts Agent B's estimate of A's location. To address this, they propose an Epistemic Sensory Bottleneck module that avoids rigid coordinate transformations, instead using an Anchor mechanism. The research highlights that current MLLMs lack embodied spatial reasoning, which is essential for tasks like navigation and human-robot interaction.

Key facts

Paper arXiv:2605.18194 introduces the concept of 'Cartesian Illusion' in MLLMs.
MLLMs lack grounded 3D topological understanding for spatial reasoning.
The limitation is exposed in multi-agent environments requiring second-order Theory of Mind.
A novel audio-visual task requires Agent A to predict Agent B's estimation of A's location.
The proposed solution is an Epistemic Sensory Bottleneck module.
The module abandons rigid, rule-based coordinate transformations.
An Anchor mechanism is introduced as part of the solution.
The research focuses on embodied spatial intelligence in MLLMs.

MLLMs Face 'Cartesian Illusion' in Spatial Reasoning Tasks

Key facts

Entities

Institutions

Sources