LLM Agent Scaffolding: More Components Can Hurt Performance
A new study on arXiv (2605.05716) reveals that adding more scaffolding components to LLM agent systems can degrade performance due to cross-component interference (CCI). Researchers ran a full factorial experiment over all 32 subsets of five components (planning, tools, memory, self-reflection, retrieval) on HotpotQA and GSM8K using Llama-3.1-8B/70B, totaling 96 conditions with up to 10 seeds. The All-In system, using all five components, was consistently suboptimal. On HotpotQA, a single-tool agent outperformed All-In by 32% (F1 0.233 vs 0.177, p=0.023). On GSM8K, a 3-component subset beat All-In by 79% (0.43 vs 0.24, p=0.010). The optimal number of components is task-dependent (k*=1-4) and scale-sensitive: at 70B, some combinations that hurt at 8B provided gains, but All-In still trailed the best subset. A main-effects regression achieved R²=0.916 (adj-R²=0.899, LOOCV=0.872). Exact Shapley values revealed 183/325 submodularity violations (56.3%), indicating greedy selection is unreliable.
Key facts
- Cross-component interference (CCI) degrades LLM agent performance when components interact destructively.
- Full factorial experiment over 2^5=32 subsets of five components on HotpotQA and GSM8K.
- Used Llama-3.1-8B/70B with 96 conditions and up to 10 seeds.
- All-In system (all five components) was consistently suboptimal.
- On HotpotQA, single-tool agent surpassed All-In by 32% (F1 0.233 vs 0.177, p=0.023).
- On GSM8K, a 3-component subset beat All-In by 79% (0.43 vs 0.24, p=0.010).
- Optimal component count is task-dependent (k*=1-4) and scale-sensitive.
- At 70B, some combinations that hurt at 8B provided gains, but All-In still trailed best subset.
- Main-effects regression: R²=0.916, adj-R²=0.899, LOOCV=0.872.
- Exact Shapley values: 183/325 submodularity violations (56.3%), showing greedy selection unreliable.
Entities
Institutions
- arXiv