Agent Island: Dynamic Benchmark for Language Model Agents
Agent Island is a multiplayer simulation environment where language-model agents compete in games of cooperation, conflict, and persuasion. It serves as a dynamic benchmark designed to resist saturation and contamination, common issues in static benchmarks. New models can always outperform the current leader in this winner-take-all game, as agents face adaptive opponents rather than fixed tasks. Players are ranked using a Bayesian Plackett-Luce model, which quantifies uncertainty in skill. In 999 games involving 49 unique models, openai/gpt-5.5 leads with a posterior mean skill of 5.64, followed by openai/gpt-5.2 at 3.10 and openai/gpt-5.3-codex at 2.86. The game logs are released as a dataset.
Key facts
- Agent Island is a multiplayer simulation environment for language-model agents.
- The benchmark is designed to mitigate saturation and contamination.
- New models can always outperform the current leading player.
- Agents compete against other adaptive agents, not fixed tasks.
- Ranking uses a Bayesian Plackett-Luce model.
- 999 games involving 49 unique models were played.
- openai/gpt-5.5 has a posterior mean skill of 5.64.
- openai/gpt-5.2 has a posterior mean skill of 3.10.
- openai/gpt-5.3-codex has a posterior mean skill of 2.86.
- Game logs are released as a dataset.
Entities
—