ARTFEED — Contemporary Art Intelligence

SocialGrid Benchmark Reveals LLMs Struggle with Social Reasoning in Multi-Agent Environments

ai-technology · 2026-04-20

A novel benchmark named SocialGrid has been introduced to assess Large Language Models (LLMs) as independent agents within embodied multi-agent environments, uncovering notable shortcomings in their social reasoning and planning skills. Drawing inspiration from the game Among Us, this environment evaluates LLM agents on their planning, execution of tasks, and social reasoning. Results indicate that even the most advanced open model, GPT-OSS-120B, falls short, achieving less than 60% accuracy in task execution and planning. Agents demonstrate repetitive actions and struggle with basic obstacles. To differentiate social reasoning from planning issues, SocialGrid includes an optional Planning Oracle. While this support enhances task completion, social reasoning remains a challenge, with agents unable to detect deception beyond random chance, relying instead on superficial heuristics. This benchmark underscores the urgent need to assess LLMs' social intelligence as they evolve from text processors to autonomous agents in multi-agent contexts. The research, identified as arXiv:2604.16022v1, emphasizes that inadequate navigation can skew evaluations of social intelligence, highlighting the necessity of tools like the Planning Oracle for more accurate assessments. Despite task execution improvements with assistance, the ongoing failure in deception detection reveals significant deficiencies in LLMs' social reasoning capabilities.

Key facts

  • SocialGrid is an embodied multi-agent environment benchmark for evaluating LLMs
  • Inspired by the game Among Us
  • Evaluates LLM agents on planning, task execution, and social reasoning
  • GPT-OSS-120B achieves below 60% accuracy in task completion and planning
  • Agents get stuck in repetitive behaviors or fail to navigate basic obstacles
  • SocialGrid offers an optional Planning Oracle to isolate social reasoning from planning deficits
  • Agents fail to detect deception at near-random chance regardless of scale
  • Research published on arXiv with identifier arXiv:2604.16022v1

Entities

Institutions

  • arXiv

Sources