Parallel Compaction Boosts Long-Horizon LLM Agent Efficiency

ai-technology · 2026-05-25

A new arXiv paper introduces parallel context compaction for long-horizon LLM agent serving, addressing the bottleneck of sequential summarization that stalls inference for tens of seconds. The method compresses growing conversation histories without blocking agent execution, offering fine-grained control over summary volume—a capability lacking in prompt-based approaches. Evaluated across four backbones (8B to 120B parameters, dense and MoE architectures, reasoning and non-reasoning models) on HotpotQA and LoCoMo benchmarks, parallel compaction outperforms the sequential synchronous baseline in consistency and speed. The work highlights unpredictable token output and information retention in current summarization methods, which undermine agent knowledge predictability across runs.

Key facts

arXiv paper 2605.23296 introduces parallel context compaction for long-horizon LLM agents.
Sequential summarization blocks agent inference for tens of seconds.
Prompt instructions for summary volume are largely ignored by current models.
Output token count and retained information fluctuate substantially across runs.
Method evaluated on four backbones from 8B to 120B parameters.
Backbones include dense and MoE architectures, reasoning and non-reasoning models.
Benchmarks used: HotpotQA (multi-hop QA) and LoCoMo (long-context).
Parallel compaction offers fine-grained control over summary volume.

Parallel Compaction Boosts Long-Horizon LLM Agent Efficiency

Key facts

Entities

Institutions

Sources