Sample Difficulty's Non-Monotonic Role in RLVR for LLMs

ai-technology · 2026-05-28

A new arXiv preprint (2605.28388) investigates the mechanistic role of sample difficulty in Reinforcement Learning with Verifiable Reward (RLVR) for large language models (LLMs). The study finds that sample difficulty has a non-monotonic effect: easy and medium-difficulty problems yield the strongest reasoning improvements, while overly hard problems provide weak learning signals, induce degenerate behaviors like answer repetition or skipping necessary computation, and can degrade pre-existing capabilities. Using Temporal Sparse Autoencoders (T-SAE), the authors analyze internal feature dynamics, revealing that easy problems reinforce direct-answer and basic computation pathways. The research focuses on mathematics and programming tasks.

Key facts

Study examines RLVR for LLMs
Sample difficulty has non-monotonic effect
Easy and medium problems yield strongest improvements
Overly hard problems cause weak signals and degenerate behaviors
Degenerate behaviors include answer repetition and skipping computation
Hard problems can degrade pre-existing capabilities
Temporal Sparse Autoencoders (T-SAE) used for internal analysis
Focus on mathematics and programming tasks

Sample Difficulty's Non-Monotonic Role in RLVR for LLMs

Key facts

Entities

Institutions

Sources