Large Language Models as Strategic Agents in Timed Risk Game

ai-technology · 2026-05-23

A recent arXiv study (2605.22238) investigates the performance of large language models as real-time strategic agents within a timed, multi-phase Risk setting that includes specific victory objectives and iterative planning-execution cycles. In a championship featuring 32 games under fixed rules, Gemini 3.1 Pro Preview triumphed in 20 matches against competitors GPT-5.1, Claude Opus 4-7, and Kimi K2.6, with a significant winner distribution differing from an equal-strength null (p ≈ 1.5 × 10⁻⁵). When execution was standardized using a more economical Gemini Flash scaffold, a pooled bakeoff of 32 planners revealed near-equal performance (p ≈ 0.821), suggesting that the initial provider variability stemmed from overall system behavior rather than planning alone. This research underscores critical gaps in timed risk gameplay and emphasizes the need for hybrid decomposition in assessing LLM effectiveness in dynamic environments.

Key facts

Study evaluates LLMs as live strategic agents in timed Risk game
Gemini 3.1 Pro Preview won 20 of 32 games against GPT-5.1, Claude Opus 4-7, Kimi K2.6
Winner distribution p ≈ 1.5 × 10⁻⁵ under equal-strength null
Planning separated from execution using Gemini Flash scaffold
Planner bakeoff showed near-equality (p ≈ 0.821)
Provider spread attributed to end-to-end system behavior
Research from arXiv paper 2605.22238
Focus on operational gaps in timed risk play

Large Language Models as Strategic Agents in Timed Risk Game

Key facts

Entities

Institutions

Sources