LLM Agent Harness Complexity Paradox: Strict Guidance Hurts Frontier Chat Models

ai-technology · 2026-05-27

A new study from arXiv (2605.26731) challenges the assumption that more structured harnesses universally improve LLM agent reliability and that higher-capability models need less guidance. In a 432-run experiment across six models and four capability tiers using the HEAT-24 benchmark, researchers found a non-monotone relationship. For Gemini 2.5 Flash, increased harness verbosity lowered VTSR by 29-38 percentage points, revealing a harness-complexity paradox. For Qwen3.5-122B with extended thinking, strict harness achieved the highest VTSR at 91.7%.

Key facts

arXiv paper 2605.26731
432-run experiment
six models across four capability tiers
three harness conditions: light, balanced, strict
HEAT-24 benchmark with 24 tasks
Gemini 2.5 Flash VTSR drop of 29-38 percentage points with increased harness verbosity
Qwen3.5-122B strict harness VTSR 91.7%
monotone inverse relationship refuted

LLM Agent Harness Complexity Paradox: Strict Guidance Hurts Frontier Chat Models

Key facts

Entities

Institutions

Sources