LLMs Fail to Grasp Embodied Spatial Language Across Cultures
A recent study published on arXiv (2604.25423) examines whether large language models (LLMs) can grasp embodied cognition and cultural differences through the use of demonstratives—terms like 'this/that' in English and 'zhè/nà' in Chinese. The researchers gathered 6,400 responses from 320 native speakers to create a human baseline: while English speakers can consistently differentiate between proximal and distal references, they find perspective-taking challenging. Conversely, Chinese speakers navigate perspectives smoothly but accept some ambiguity. In comparison, five advanced LLMs do not naturally comprehend the proximal-distal distinction and lack cultural sensitivity, relying instead on English-centric logic. This study proposes a novel task centered on demonstratives to assess embodied cognition and cultural norms, highlighting cross-cultural interpretative disparities and shedding light on egocentric biases in AI.
Key facts
- Study published on arXiv with ID 2604.25423
- Uses demonstratives (this/that, zhè/nà) as a probe for grounded knowledge
- 6,400 responses from 320 native speakers establish human baseline
- English speakers reliably distinguish proximal-distal but struggle with perspective-taking
- Chinese speakers switch perspectives fluently but tolerate distal ambiguity
- Five state-of-the-art LLMs fail to understand proximal-distal contrast
- LLMs show no cultural differences and default to English-centric reasoning
- New task introduced for evaluating embodied cognition and cultural conventions
Entities
Institutions
- arXiv