Agentick: A Unified Benchmark for Sequential Decision-Making Agents

ai-technology · 2026-05-11

A new benchmark called Agentick has been launched by researchers to assess sequential decision-making agents, incorporating RL, LLM, VLM, human, and hybrid methods. This benchmark features 37 tasks generated procedurally, categorized into six capabilities, four levels of difficulty, and five types of observations, all available through a Gymnasium-compatible interface. It offers a Coding API, reference policies from oracles, pre-constructed SFT datasets, a modular agent harness, and a real-time leaderboard. An analysis of 27 configurations across 90,000 episodes revealed that no single method is superior, with GPT-5 mini achieving the highest overall score of 0.309.

Key facts

Agentick is a benchmark for sequential decision-making agents.
It evaluates RL, LLM, VLM, hybrid, and human agents.
The benchmark includes 37 procedurally generated tasks.
Tasks span six capability categories and four difficulty levels.
Five observation modalities are supported.
The interface is Gymnasium-compatible.
A Coding API, oracle reference policies, pre-built SFT datasets, a composable agent harness, and a live leaderboard are provided.
An evaluation of 27 configurations over 90,000 episodes showed GPT-5 mini leads with 0.309.

Agentick: A Unified Benchmark for Sequential Decision-Making Agents

Key facts

Entities

Institutions

Sources