Adaptive Negative Reinforcement Boosts LLM Reasoning
A new arXiv preprint (2605.07137) introduces Adaptive Negative Sample Reinforcement (A-NSR), an extension of Negative Sample Reinforcement (NSR) for improving reasoning in Large Language Models (LLMs). NSR penalizes incorrect reasoning steps rather than rewarding correct ones, matching or exceeding complex frameworks like PPO and GRPO across the Pass@k spectrum. However, current NSR applies a fixed penalty uniformly. A-NSR uses time-dependent scheduling: early training focuses on error correction to stabilize the model, then shifts to more subtle adjustments. The paper proposes dynamically balancing correction and diversity in reinforcement learning with verifiable rewards (RLVR).
Key facts
- arXiv paper 2605.07137 introduces A-NSR
- A-NSR extends Negative Sample Reinforcement (NSR)
- NSR penalizes incorrect steps, not just rewarding correct ones
- NSR matches or exceeds PPO and GRPO across Pass@k
- Current NSR uses fixed penalty throughout training
- A-NSR uses time-dependent scheduling functions
- Early training focuses on error correction
- Later training shifts to subtle adjustments
Entities
Institutions
- arXiv