LLM Agent Harness Complexity Paradox: Strict Guidance Hurts Frontier Chat Models
A new study from arXiv (2605.26731) challenges the assumption that more structured harnesses universally improve LLM agent reliability and that higher-capability models need less guidance. In a 432-run experiment across six models and four capability tiers using the HEAT-24 benchmark, researchers found a non-monotone relationship. For Gemini 2.5 Flash, increased harness verbosity lowered VTSR by 29-38 percentage points, revealing a harness-complexity paradox. For Qwen3.5-122B with extended thinking, strict harness achieved the highest VTSR at 91.7%.
Key facts
- arXiv paper 2605.26731
- 432-run experiment
- six models across four capability tiers
- three harness conditions: light, balanced, strict
- HEAT-24 benchmark with 24 tasks
- Gemini 2.5 Flash VTSR drop of 29-38 percentage points with increased harness verbosity
- Qwen3.5-122B strict harness VTSR 91.7%
- monotone inverse relationship refuted
Entities
Institutions
- arXiv