MemReward: Graph-Based Memory for LLM Reward with Scarce Labels
MemReward is a graph-based experience memory framework designed to improve reward prediction for large language models (LLMs) in reinforcement learning when ground-truth labels are limited. The system stores rollouts (thinking processes) and propagates reward signals from labeled to unlabeled samples, inspired by semi-supervised learning. This addresses challenges in data-scarce scenarios, such as evaluating mathematical proofs or open-ended question answering, where human annotation or expert verification is expensive. MemReward integrates directly into online policy optimization, enhancing the effectiveness of reinforcement learning fine-tuning with scarce labels. The paper is available on arXiv under ID 2603.19310.
Key facts
- MemReward is a graph-based experience memory framework for LLM reward prediction.
- It addresses reinforcement learning with limited ground-truth labels.
- The method propagates rewards from labeled to unlabeled rollouts.
- It is inspired by semi-supervised learning techniques.
- Target applications include mathematical proof evaluation and open-ended QA.
- MemReward integrates into online policy optimization.
- The paper is published on arXiv with ID 2603.19310.
- It aims to reduce reliance on expensive human annotation.
Entities
Institutions
- arXiv