Scale-Conditioned Evaluation of Agent Memory Reveals Reliability Loss

publication · 2026-05-11

A recent study published on arXiv (2605.07313) presents a novel evaluation protocol for memory agents that is conditioned on scale. This protocol assesses the usability of stored evidence as unrelated sessions grow, maintaining task evidence constant while introducing extraneous data. The findings include four key diagnostics: budget-compliant reliability, tail memory-call burden, failure-regime decomposition, and usable-scale boundary. When applied to LongMemEval and LoCoMo benchmarks across various memory interfaces—flat, planar, and hierarchical—the research indicates that reliability loss manifests in multiple ways. For instance, HippoRAG adheres to a two-call budget but experiences a decline of 16–20 percentage points in budget-compliant reliability with the increase of irrelevant sessions.

Key facts

Paper arXiv:2605.07313 introduces scale-conditioned evaluation for agent memory.
Protocol holds task evidence fixed while adding irrelevant sessions.
Four diagnostics: budget-compliant reliability, tail memory-call burden, failure-regime decomposition, usable-scale boundary.
Applied to LongMemEval and LoCoMo benchmarks.
HippoRAG loses 16–20 percentage points in reliability as irrelevant sessions accumulate.
Memory interfaces tested: flat, planar, hierarchical.
Reliability loss is not a single phenomenon.
Study published on arXiv.

Scale-Conditioned Evaluation of Agent Memory Reveals Reliability Loss

Key facts

Entities

Institutions

Sources