Selective Eligibility Traces Improve RLVR for LLMs
A novel approach called Selective Eligibility Traces (S-trace) has been introduced by researchers to improve Reinforcement Learning with Verifiable Rewards (RLVR) for large language models. Existing critic-free algorithms, such as GRPO, struggle with uniform credit assignment, which limits their ability to identify essential reasoning steps. S-trace enhances P-trace, a method known for its sample efficiency in eligibility traces, by employing sparse eligibility traces. This technique reduces variance and allows for precise credit assignment through the selective masking of low-entropy tokens. This research is situated within the framework of recent developments in Group Sequence Policy Optimization (GSPO) and is elaborated upon in arXiv paper 2605.05965.
Key facts
- arXiv paper 2605.05965 proposes Selective Eligibility Traces (S-trace) for RLVR.
- S-trace addresses uniform credit assignment limitation in GRPO.
- P-trace is introduced as a sample-efficient, critic-free eligibility traces method.
- S-trace implements sparse eligibility traces by masking low-entropy tokens.
- The method aims to improve reasoning abilities of large language models.
- The paper contextualizes S-trace within recent GSPO work.
Entities
Institutions
- arXiv