MemReward: Graph-Based Memory for LLM Reward with Scarce Labels

other · 2026-05-25

MemReward is a graph-based experience memory framework designed to improve reward prediction for large language models (LLMs) in reinforcement learning when ground-truth labels are limited. The system stores rollouts (thinking processes) and propagates reward signals from labeled to unlabeled samples, inspired by semi-supervised learning. This addresses challenges in data-scarce scenarios, such as evaluating mathematical proofs or open-ended question answering, where human annotation or expert verification is expensive. MemReward integrates directly into online policy optimization, enhancing the effectiveness of reinforcement learning fine-tuning with scarce labels. The paper is available on arXiv under ID 2603.19310.

Key facts

MemReward is a graph-based experience memory framework for LLM reward prediction.
It addresses reinforcement learning with limited ground-truth labels.
The method propagates rewards from labeled to unlabeled rollouts.
It is inspired by semi-supervised learning techniques.
Target applications include mathematical proof evaluation and open-ended QA.
MemReward integrates into online policy optimization.
The paper is published on arXiv with ID 2603.19310.
It aims to reduce reliance on expensive human annotation.

MemReward: Graph-Based Memory for LLM Reward with Scarce Labels

Key facts

Entities

Institutions

Sources