R³L: Reflect-then-Retry Reinforcement Learning for LLM Reasoning
A new reinforcement learning method called R³L (Reflect-then-Retry Reinforcement Learning) has been proposed to improve LLM reasoning and agentic capabilities. The method addresses exploration and exploitation challenges by using language feedback to diagnose errors and transform failed attempts into successful trajectories, reducing rollout costs. It also introduces pivotal credit assignment and positive amplification to stabilize training and enhance learning from positive signals. The approach shifts from stochastic sampling to active synthesis, aiming to overcome low success rates on difficult tasks and coarse credit assignment in trajectory-level rewards.
Key facts
- R³L stands for Reflect-then-Retry Reinforcement Learning with Language-Guided Exploration, Pivotal Credit, and Positive Amplification.
- The method uses language feedback to diagnose errors and convert failed attempts into successful ones.
- It reduces rollout costs by restarting from failed trajectories rather than from scratch.
- R³L addresses coarse credit assignment by focusing on pivotal steps and amplifying positive signals.
- The approach aims to improve both exploration and exploitation in LLM reasoning tasks.
- The paper is available on arXiv with ID 2601.03715.
- The method shifts from stochastic sampling to active synthesis of high-quality trajectories.
- R³L is designed to stabilize training in failure-dominated groups.
Entities
Institutions
- arXiv