GENSTRAT: Procedurally Generated Games Test LLM Strategic Reasoning

ai-technology · 2026-05-25

Researchers have unveiled GENSTRAT, a new benchmark designed to assess the strategic reasoning abilities of large language models (LLMs) through procedurally generated two-player zero-sum imperfect-information card games. This generator can create new games on demand, ensuring continuous evaluation and protection against data contamination. The framework combines game distribution with a capability-profile methodology that breaks down model performance across six dimensions, such as state space and temporal reasoning. This innovative approach overcomes the shortcomings of existing fixed canonical game benchmarks, which may become ineffective as models advance and fail to apply to real-world strategic scenarios where LLMs are increasingly utilized as economic agents in marketplaces, auctions, and bidding environments.

Key facts

GENSTRAT uses procedurally generated strategic environments
Evaluates LLMs on two-player zero-sum imperfect-information card games
Generator can draw fresh games on demand for evergreen evaluation
Resistant to contamination
Capability-profile methodology decomposes competence across six axes
Addresses limitations of fixed canonical game benchmarks
LLMs are increasingly deployed as economic agents in marketplaces, auctions, and bidding settings
arXiv:2605.23238v1

GENSTRAT: Procedurally Generated Games Test LLM Strategic Reasoning

Key facts

Entities

Institutions

Sources