LLMs Fail to Grasp Embodied Spatial Language Across Cultures

publication · 2026-04-30

A recent study published on arXiv (2604.25423) examines whether large language models (LLMs) can grasp embodied cognition and cultural differences through the use of demonstratives—terms like 'this/that' in English and 'zhè/nà' in Chinese. The researchers gathered 6,400 responses from 320 native speakers to create a human baseline: while English speakers can consistently differentiate between proximal and distal references, they find perspective-taking challenging. Conversely, Chinese speakers navigate perspectives smoothly but accept some ambiguity. In comparison, five advanced LLMs do not naturally comprehend the proximal-distal distinction and lack cultural sensitivity, relying instead on English-centric logic. This study proposes a novel task centered on demonstratives to assess embodied cognition and cultural norms, highlighting cross-cultural interpretative disparities and shedding light on egocentric biases in AI.

Key facts

Study published on arXiv with ID 2604.25423
Uses demonstratives (this/that, zhè/nà) as a probe for grounded knowledge
6,400 responses from 320 native speakers establish human baseline
English speakers reliably distinguish proximal-distal but struggle with perspective-taking
Chinese speakers switch perspectives fluently but tolerate distal ambiguity
Five state-of-the-art LLMs fail to understand proximal-distal contrast
LLMs show no cultural differences and default to English-centric reasoning
New task introduced for evaluating embodied cognition and cultural conventions

LLMs Fail to Grasp Embodied Spatial Language Across Cultures

Key facts

Entities

Institutions

Sources