LLM Planning Agents: How Much Competence Comes from the Harness?
A new study from arXiv (2604.07236) investigates how much of an AI agent's performance is attributable to the planning harness versus the underlying language model. The researchers externalized a planning harness for the game Collaborative Battleship into four layers: posterior belief tracking, declarative planning, symbolic reflection, and an LLM-backed revision gate. Across 54 games, they measured win rate and F1 score, defining 'heavy lifting' as the largest positive marginal contribution to win rate. Declarative planning alone provided a +24.1 percentage point increase in win rate over a belief-only harness, requiring zero LLM calls. The findings suggest that the harness itself carries significant competence, raising questions about the residual role of the LLM in planning agents.
Key facts
- arXiv:2604.07236
- Agent harnesses can change end-to-end performance by as much as six times on a fixed model
- Planning harness for Collaborative Battleship externalized into four layers
- Declarative planning provided +24.1 pp win rate over belief-only harness
- Zero LLM calls needed for declarative planning layer
- 54 games were played
- Primary metric: win rate; secondary: F1
- Heavy lifting defined as largest positive marginal to primary metric
Entities
Institutions
- arXiv