Selective Eligibility Traces Improve RLVR for LLMs

ai-technology · 2026-05-09

A novel approach called Selective Eligibility Traces (S-trace) has been introduced by researchers to improve Reinforcement Learning with Verifiable Rewards (RLVR) for large language models. Existing critic-free algorithms, such as GRPO, struggle with uniform credit assignment, which limits their ability to identify essential reasoning steps. S-trace enhances P-trace, a method known for its sample efficiency in eligibility traces, by employing sparse eligibility traces. This technique reduces variance and allows for precise credit assignment through the selective masking of low-entropy tokens. This research is situated within the framework of recent developments in Group Sequence Policy Optimization (GSPO) and is elaborated upon in arXiv paper 2605.05965.

Key facts

arXiv paper 2605.05965 proposes Selective Eligibility Traces (S-trace) for RLVR.
S-trace addresses uniform credit assignment limitation in GRPO.
P-trace is introduced as a sample-efficient, critic-free eligibility traces method.
S-trace implements sparse eligibility traces by masking low-entropy tokens.
The method aims to improve reasoning abilities of large language models.
The paper contextualizes S-trace within recent GSPO work.

Selective Eligibility Traces Improve RLVR for LLMs

Key facts

Entities

Institutions

Sources