ARTFEED — Contemporary Art Intelligence

LLM Agent Harness Complexity Paradox: Strict Guidance Hurts Frontier Chat Models

ai-technology · 2026-05-27

A new study from arXiv (2605.26731) challenges the assumption that more structured harnesses universally improve LLM agent reliability and that higher-capability models need less guidance. In a 432-run experiment across six models and four capability tiers using the HEAT-24 benchmark, researchers found a non-monotone relationship. For Gemini 2.5 Flash, increased harness verbosity lowered VTSR by 29-38 percentage points, revealing a harness-complexity paradox. For Qwen3.5-122B with extended thinking, strict harness achieved the highest VTSR at 91.7%.

Key facts

  • arXiv paper 2605.26731
  • 432-run experiment
  • six models across four capability tiers
  • three harness conditions: light, balanced, strict
  • HEAT-24 benchmark with 24 tasks
  • Gemini 2.5 Flash VTSR drop of 29-38 percentage points with increased harness verbosity
  • Qwen3.5-122B strict harness VTSR 91.7%
  • monotone inverse relationship refuted

Entities

Institutions

  • arXiv

Sources