LLM Agent Scaffolding: More Components Can Hurt Performance

ai-technology · 2026-05-09

A new study on arXiv (2605.05716) reveals that adding more scaffolding components to LLM agent systems can degrade performance due to cross-component interference (CCI). Researchers ran a full factorial experiment over all 32 subsets of five components (planning, tools, memory, self-reflection, retrieval) on HotpotQA and GSM8K using Llama-3.1-8B/70B, totaling 96 conditions with up to 10 seeds. The All-In system, using all five components, was consistently suboptimal. On HotpotQA, a single-tool agent outperformed All-In by 32% (F1 0.233 vs 0.177, p=0.023). On GSM8K, a 3-component subset beat All-In by 79% (0.43 vs 0.24, p=0.010). The optimal number of components is task-dependent (k*=1-4) and scale-sensitive: at 70B, some combinations that hurt at 8B provided gains, but All-In still trailed the best subset. A main-effects regression achieved R²=0.916 (adj-R²=0.899, LOOCV=0.872). Exact Shapley values revealed 183/325 submodularity violations (56.3%), indicating greedy selection is unreliable.

Key facts

Cross-component interference (CCI) degrades LLM agent performance when components interact destructively.
Full factorial experiment over 2^5=32 subsets of five components on HotpotQA and GSM8K.
Used Llama-3.1-8B/70B with 96 conditions and up to 10 seeds.
All-In system (all five components) was consistently suboptimal.
On HotpotQA, single-tool agent surpassed All-In by 32% (F1 0.233 vs 0.177, p=0.023).
On GSM8K, a 3-component subset beat All-In by 79% (0.43 vs 0.24, p=0.010).
Optimal component count is task-dependent (k*=1-4) and scale-sensitive.
At 70B, some combinations that hurt at 8B provided gains, but All-In still trailed best subset.
Main-effects regression: R²=0.916, adj-R²=0.899, LOOCV=0.872.
Exact Shapley values: 183/325 submodularity violations (56.3%), showing greedy selection unreliable.

LLM Agent Scaffolding: More Components Can Hurt Performance

Key facts

Entities

Institutions

Sources