Sample Difficulty's Non-Monotonic Role in RLVR for LLMs
A new arXiv preprint (2605.28388) investigates the mechanistic role of sample difficulty in Reinforcement Learning with Verifiable Reward (RLVR) for large language models (LLMs). The study finds that sample difficulty has a non-monotonic effect: easy and medium-difficulty problems yield the strongest reasoning improvements, while overly hard problems provide weak learning signals, induce degenerate behaviors like answer repetition or skipping necessary computation, and can degrade pre-existing capabilities. Using Temporal Sparse Autoencoders (T-SAE), the authors analyze internal feature dynamics, revealing that easy problems reinforce direct-answer and basic computation pathways. The research focuses on mathematics and programming tasks.
Key facts
- Study examines RLVR for LLMs
- Sample difficulty has non-monotonic effect
- Easy and medium problems yield strongest improvements
- Overly hard problems cause weak signals and degenerate behaviors
- Degenerate behaviors include answer repetition and skipping computation
- Hard problems can degrade pre-existing capabilities
- Temporal Sparse Autoencoders (T-SAE) used for internal analysis
- Focus on mathematics and programming tasks
Entities
Institutions
- arXiv