ChessArena Testbed Reveals LLMs Lack Strategic Reasoning
A new research paper introduces ChessArena, a chess-based testbed designed to evaluate whether large language models (LLMs) possess genuine strategic reasoning or merely excel at pattern recognition. The framework pits LLMs against each other across four play modes, testing basic understanding, move selection, and puzzle solving. Over 800 games involving 13 LLMs, results show significant shortcomings: no model beats Maia-1100, a human amateur-level chess engine, and some models even lose to random play. The study also presents a strong baseline: a fine-tuned Qwen3-8B model substantially improves performance, approaching much larger state-of-the-art reasoning models. The paper was submitted to arXiv on September 25, 2025.
Key facts
- ChessArena is a testbed for evaluating strategic reasoning in LLMs.
- It uses chess to test reasoning, rule adherence, and game state tracking.
- 13 LLMs were evaluated across over 800 games.
- No model beat Maia-1100, a human amateur-level engine.
- Some models lost to random play.
- A fine-tuned Qwen3-8B model showed substantial improvement.
- The paper is available on arXiv (2509.24239).
- The study questions whether LLMs have genuine strategic reasoning.
Entities
Institutions
- arXiv
- ChessArena
- Maia-1100
- Qwen3-8B