LLM Planning Agents: How Much Competence Comes from the Harness?

ai-technology · 2026-04-30

A new study from arXiv (2604.07236) investigates how much of an AI agent's performance is attributable to the planning harness versus the underlying language model. The researchers externalized a planning harness for the game Collaborative Battleship into four layers: posterior belief tracking, declarative planning, symbolic reflection, and an LLM-backed revision gate. Across 54 games, they measured win rate and F1 score, defining 'heavy lifting' as the largest positive marginal contribution to win rate. Declarative planning alone provided a +24.1 percentage point increase in win rate over a belief-only harness, requiring zero LLM calls. The findings suggest that the harness itself carries significant competence, raising questions about the residual role of the LLM in planning agents.

Key facts

arXiv:2604.07236
Agent harnesses can change end-to-end performance by as much as six times on a fixed model
Planning harness for Collaborative Battleship externalized into four layers
Declarative planning provided +24.1 pp win rate over belief-only harness
Zero LLM calls needed for declarative planning layer
54 games were played
Primary metric: win rate; secondary: F1
Heavy lifting defined as largest positive marginal to primary metric

LLM Planning Agents: How Much Competence Comes from the Harness?

Key facts

Entities

Institutions

Sources