Frontier LLMs Match or Beat Classical Planners on IPC Tasks

ai-technology · 2026-05-18

A recent investigation published on arXiv disputes earlier conclusions regarding the inability of large language models to effectively tackle planning tasks. The researchers assessed three advanced LLMs—Gemini 3.1 Pro, GPT-5, and another unnamed model—using a demanding array of tasks derived from the latest International Planning Competition. They employed a validation tool to confirm solutions, generated new tasks to prevent data contamination, and compared the results with leading classical planners. Gemini 3.1 Pro successfully solved 245 out of 360 tasks, surpassing the best planner baseline, which solved 234. GPT-5 demonstrated similar performance to the baselines. Although performance declined when semantic information was obscured, Gemini 3.1 Pro remained competitive, challenging the previous assertion that LLMs struggle with even basic planning tasks.

Key facts

The study evaluates three frontier LLMs on planning tasks from the International Planning Competition.
Gemini 3.1 Pro solved 245 of 360 tasks, outperforming the strongest classical planner baseline (234).
GPT-5 achieved performance comparable to classical planner baselines.
Tasks were freshly created to avoid data contamination.
Solutions were verified with a validation tool.
When semantic information was obfuscated, Gemini 3.1 Pro remained competitive with the strongest baselines.
The study overturns earlier findings that LLMs cannot reliably solve simple planning tasks.
The research is published on arXiv under identifier 2511.09378.

Frontier LLMs Match or Beat Classical Planners on IPC Tasks

Key facts

Entities

Institutions

Sources