Agent Island: Dynamic Benchmark for Language Model Agents

ai-technology · 2026-05-07

Agent Island is a multiplayer simulation environment where language-model agents compete in games of cooperation, conflict, and persuasion. It serves as a dynamic benchmark designed to resist saturation and contamination, common issues in static benchmarks. New models can always outperform the current leader in this winner-take-all game, as agents face adaptive opponents rather than fixed tasks. Players are ranked using a Bayesian Plackett-Luce model, which quantifies uncertainty in skill. In 999 games involving 49 unique models, openai/gpt-5.5 leads with a posterior mean skill of 5.64, followed by openai/gpt-5.2 at 3.10 and openai/gpt-5.3-codex at 2.86. The game logs are released as a dataset.

Key facts

Agent Island is a multiplayer simulation environment for language-model agents.
The benchmark is designed to mitigate saturation and contamination.
New models can always outperform the current leading player.
Agents compete against other adaptive agents, not fixed tasks.
Ranking uses a Bayesian Plackett-Luce model.
999 games involving 49 unique models were played.
openai/gpt-5.5 has a posterior mean skill of 5.64.
openai/gpt-5.2 has a posterior mean skill of 3.10.
openai/gpt-5.3-codex has a posterior mean skill of 2.86.
Game logs are released as a dataset.

Entities

—

Sources

arXiv cs.AI — 2026-05-07