Value-Gradient Hypothesis Explains RL Success in LLMs
A recent study available on arXiv (2605.21654) introduces the value-gradient hypothesis to clarify the effectiveness of critic-free reinforcement learning techniques, such as PPO and GRPO, in enhancing pretrained language models. The researchers demonstrate that, when utilizing a differentiable rollout alongside an additive-noise parameterization, the actor update in critic-free RL effectively estimates a value gradient in expectation. For discrete transformer policies, the process of autodifferentiation via attention produces empirical costates that closely resemble this value signal, with the error managed by sampling gap and policy entropy. This research breaks down the influence of RL into value gradient signals and attainable reward potential, providing a framework for identifying optimal RL application after training.
Key facts
- Paper is on arXiv with ID 2605.21654
- Proposes value-gradient hypothesis for critic-free RL in LLMs
- Covers PPO and GRPO methods
- Uses differentiable rollout and additive-noise parameterization
- Shows actor update is value-gradient-like in expectation
- Autodifferentiation through attention produces empirical costates
- Error in costates controlled by sampling gap and policy entropy
- Decomposes RL impact into value gradient signal and reachable reward headroom
Entities
Institutions
- arXiv