LLMs Suppress Nash Play via Prosocial Override in Final Layers
A team of researchers from an undisclosed institution has uncovered the neural mechanism responsible for the divergence of large language models from Nash equilibrium in strategic games. Analyzing Llama-3 and Qwen2.5 models, which range from 8B to 72B parameters, they discovered that the first layer encodes opponent history with 96% accuracy, while Nash action encoding remains below 56% across all 32 layers of Llama-3-8B. Although the model tends to favor Nash actions during most of its forward pass, a prosocial override in the final layers alters this inclination, resulting in non-equilibrium behavior. This research offers both mechanistic insights and causal control, indicating that intervention in the final-layer override can correct the deviation, shedding light on strategic decision-making in LLMs and its implications for AI alignment and game theory.
Key facts
- Study examines Llama-3 and Qwen2.5 models (8B to 72B parameters)
- Four canonical two-player games used in self-play and cross-play experiments
- Opponent history encoded with 96% probe accuracy at first layer of Llama-3-8B
- Nash action encoding never exceeds 56% across all layers
- No dedicated Nash module found in the model
- Prosocial override in final layers reverses private Nash preference
- Deviation can be reversed through causal intervention on final layers
- Paper published on arXiv with ID 2604.27167
Entities
Institutions
- arXiv