ARTFEED — Contemporary Art Intelligence

LLMs Suppress Nash Play via Prosocial Override in Final Layers

ai-technology · 2026-05-01

A team of researchers from an undisclosed institution has uncovered the neural mechanism responsible for the divergence of large language models from Nash equilibrium in strategic games. Analyzing Llama-3 and Qwen2.5 models, which range from 8B to 72B parameters, they discovered that the first layer encodes opponent history with 96% accuracy, while Nash action encoding remains below 56% across all 32 layers of Llama-3-8B. Although the model tends to favor Nash actions during most of its forward pass, a prosocial override in the final layers alters this inclination, resulting in non-equilibrium behavior. This research offers both mechanistic insights and causal control, indicating that intervention in the final-layer override can correct the deviation, shedding light on strategic decision-making in LLMs and its implications for AI alignment and game theory.

Key facts

  • Study examines Llama-3 and Qwen2.5 models (8B to 72B parameters)
  • Four canonical two-player games used in self-play and cross-play experiments
  • Opponent history encoded with 96% probe accuracy at first layer of Llama-3-8B
  • Nash action encoding never exceeds 56% across all layers
  • No dedicated Nash module found in the model
  • Prosocial override in final layers reverses private Nash preference
  • Deviation can be reversed through causal intervention on final layers
  • Paper published on arXiv with ID 2604.27167

Entities

Institutions

  • arXiv

Sources