Adaptive Negative Reinforcement Boosts LLM Reasoning

ai-technology · 2026-05-11

A new arXiv preprint (2605.07137) introduces Adaptive Negative Sample Reinforcement (A-NSR), an extension of Negative Sample Reinforcement (NSR) for improving reasoning in Large Language Models (LLMs). NSR penalizes incorrect reasoning steps rather than rewarding correct ones, matching or exceeding complex frameworks like PPO and GRPO across the Pass@k spectrum. However, current NSR applies a fixed penalty uniformly. A-NSR uses time-dependent scheduling: early training focuses on error correction to stabilize the model, then shifts to more subtle adjustments. The paper proposes dynamically balancing correction and diversity in reinforcement learning with verifiable rewards (RLVR).

Key facts

arXiv paper 2605.07137 introduces A-NSR
A-NSR extends Negative Sample Reinforcement (NSR)
NSR penalizes incorrect steps, not just rewarding correct ones
NSR matches or exceeds PPO and GRPO across Pass@k
Current NSR uses fixed penalty throughout training
A-NSR uses time-dependent scheduling functions
Early training focuses on error correction
Later training shifts to subtle adjustments

Adaptive Negative Reinforcement Boosts LLM Reasoning

Key facts

Entities

Institutions

Sources