EngramaBench: New Benchmark for Long-Term Conversational Memory
Researchers have created EngramaBench to evaluate how well large language model assistants remember conversations over time. This benchmark features five unique personas, includes one hundred multi-session dialogues, and presents one hundred fifty questions covering areas like factual recall and reasoning. In the assessment, Engrama, which is a graph-based memory system, was compared to GPT-4o using full-context prompts, alongside Mem0, an open-source vector-retrieval memory system that also used GPT-4o for responses. GPT-4o with full context topped the scores at 0.6186, while Engrama achieved 0.5367 overall, excelling in cross-space reasoning at 0.6532. Mem0, despite being cheaper, lagged behind with a score of 0.4809.
Key facts
- EngramaBench includes five personas, 100 multi-session conversations, and 150 queries.
- Queries span factual recall, cross-space integration, temporal reasoning, adversarial abstention, and emergent synthesis.
- Engrama is a graph-structured memory system.
- Mem0 is an open-source vector-retrieval memory system.
- All systems use GPT-4o as the answering model.
- GPT-4o full-context achieved highest composite score (0.6186).
- Engrama scored 0.5367 globally.
- Engrama outperformed full-context on cross-space reasoning (0.6532 vs. 0.6291).
- Mem0 scored 0.4809.
Entities
Institutions
- arXiv