Value-Gradient Hypothesis Explains RL Success in LLMs

other · 2026-05-23

A recent study available on arXiv (2605.21654) introduces the value-gradient hypothesis to clarify the effectiveness of critic-free reinforcement learning techniques, such as PPO and GRPO, in enhancing pretrained language models. The researchers demonstrate that, when utilizing a differentiable rollout alongside an additive-noise parameterization, the actor update in critic-free RL effectively estimates a value gradient in expectation. For discrete transformer policies, the process of autodifferentiation via attention produces empirical costates that closely resemble this value signal, with the error managed by sampling gap and policy entropy. This research breaks down the influence of RL into value gradient signals and attainable reward potential, providing a framework for identifying optimal RL application after training.

Key facts

Paper is on arXiv with ID 2605.21654
Proposes value-gradient hypothesis for critic-free RL in LLMs
Covers PPO and GRPO methods
Uses differentiable rollout and additive-noise parameterization
Shows actor update is value-gradient-like in expectation
Autodifferentiation through attention produces empirical costates
Error in costates controlled by sampling gap and policy entropy
Decomposes RL impact into value gradient signal and reachable reward headroom

Value-Gradient Hypothesis Explains RL Success in LLMs

Key facts

Entities

Institutions

Sources