GENSTRAT: Procedurally Generated Games Test LLM Strategic Reasoning
Researchers have unveiled GENSTRAT, a new benchmark designed to assess the strategic reasoning abilities of large language models (LLMs) through procedurally generated two-player zero-sum imperfect-information card games. This generator can create new games on demand, ensuring continuous evaluation and protection against data contamination. The framework combines game distribution with a capability-profile methodology that breaks down model performance across six dimensions, such as state space and temporal reasoning. This innovative approach overcomes the shortcomings of existing fixed canonical game benchmarks, which may become ineffective as models advance and fail to apply to real-world strategic scenarios where LLMs are increasingly utilized as economic agents in marketplaces, auctions, and bidding environments.
Key facts
- GENSTRAT uses procedurally generated strategic environments
- Evaluates LLMs on two-player zero-sum imperfect-information card games
- Generator can draw fresh games on demand for evergreen evaluation
- Resistant to contamination
- Capability-profile methodology decomposes competence across six axes
- Addresses limitations of fixed canonical game benchmarks
- LLMs are increasingly deployed as economic agents in marketplaces, auctions, and bidding settings
- arXiv:2605.23238v1
Entities
Institutions
- arXiv