EngramaBench: New Benchmark for Long-Term Conversational Memory

ai-technology · 2026-04-25

Researchers have created EngramaBench to evaluate how well large language model assistants remember conversations over time. This benchmark features five unique personas, includes one hundred multi-session dialogues, and presents one hundred fifty questions covering areas like factual recall and reasoning. In the assessment, Engrama, which is a graph-based memory system, was compared to GPT-4o using full-context prompts, alongside Mem0, an open-source vector-retrieval memory system that also used GPT-4o for responses. GPT-4o with full context topped the scores at 0.6186, while Engrama achieved 0.5367 overall, excelling in cross-space reasoning at 0.6532. Mem0, despite being cheaper, lagged behind with a score of 0.4809.

Key facts

EngramaBench includes five personas, 100 multi-session conversations, and 150 queries.
Queries span factual recall, cross-space integration, temporal reasoning, adversarial abstention, and emergent synthesis.
Engrama is a graph-structured memory system.
Mem0 is an open-source vector-retrieval memory system.
All systems use GPT-4o as the answering model.
GPT-4o full-context achieved highest composite score (0.6186).
Engrama scored 0.5367 globally.
Engrama outperformed full-context on cross-space reasoning (0.6532 vs. 0.6291).
Mem0 scored 0.4809.

EngramaBench: New Benchmark for Long-Term Conversational Memory

Key facts

Entities

Institutions

Sources