ARTFEED — Contemporary Art Intelligence

LLMs Fail to Grasp Embodied Spatial Language Across Cultures

publication · 2026-04-30

A recent study published on arXiv (2604.25423) examines whether large language models (LLMs) can grasp embodied cognition and cultural differences through the use of demonstratives—terms like 'this/that' in English and 'zhè/nà' in Chinese. The researchers gathered 6,400 responses from 320 native speakers to create a human baseline: while English speakers can consistently differentiate between proximal and distal references, they find perspective-taking challenging. Conversely, Chinese speakers navigate perspectives smoothly but accept some ambiguity. In comparison, five advanced LLMs do not naturally comprehend the proximal-distal distinction and lack cultural sensitivity, relying instead on English-centric logic. This study proposes a novel task centered on demonstratives to assess embodied cognition and cultural norms, highlighting cross-cultural interpretative disparities and shedding light on egocentric biases in AI.

Key facts

  • Study published on arXiv with ID 2604.25423
  • Uses demonstratives (this/that, zhè/nà) as a probe for grounded knowledge
  • 6,400 responses from 320 native speakers establish human baseline
  • English speakers reliably distinguish proximal-distal but struggle with perspective-taking
  • Chinese speakers switch perspectives fluently but tolerate distal ambiguity
  • Five state-of-the-art LLMs fail to understand proximal-distal contrast
  • LLMs show no cultural differences and default to English-centric reasoning
  • New task introduced for evaluating embodied cognition and cultural conventions

Entities

Institutions

  • arXiv

Sources