LLMs Suppress Nash Play via Prosocial Override in Final Layers

ai-technology · 2026-05-01

A team of researchers from an undisclosed institution has uncovered the neural mechanism responsible for the divergence of large language models from Nash equilibrium in strategic games. Analyzing Llama-3 and Qwen2.5 models, which range from 8B to 72B parameters, they discovered that the first layer encodes opponent history with 96% accuracy, while Nash action encoding remains below 56% across all 32 layers of Llama-3-8B. Although the model tends to favor Nash actions during most of its forward pass, a prosocial override in the final layers alters this inclination, resulting in non-equilibrium behavior. This research offers both mechanistic insights and causal control, indicating that intervention in the final-layer override can correct the deviation, shedding light on strategic decision-making in LLMs and its implications for AI alignment and game theory.

Key facts

Study examines Llama-3 and Qwen2.5 models (8B to 72B parameters)
Four canonical two-player games used in self-play and cross-play experiments
Opponent history encoded with 96% probe accuracy at first layer of Llama-3-8B
Nash action encoding never exceeds 56% across all layers
No dedicated Nash module found in the model
Prosocial override in final layers reverses private Nash preference
Deviation can be reversed through causal intervention on final layers
Paper published on arXiv with ID 2604.27167

LLMs Suppress Nash Play via Prosocial Override in Final Layers

Key facts

Entities

Institutions

Sources