Study Reveals Scaling Laws for LLM Reinforcement Learning in Mathematical Reasoning

ai-technology · 2026-04-20

An empirical study systematically investigates the behavior of large language models (LLMs) after reinforcement learning, focusing on mathematical reasoning. This research analyzes the scaling characteristics of the Qwen2.5 dense model series, which ranges from 0.5B to 72B parameters. Four significant findings arise from the study: larger models exhibit greater learning efficiency regarding compute and data metrics, a strong predictive power-law relationship among test loss, compute, and data is observed in both base and instruction-tuned models, and while larger models are more efficient, their analytical learning efficiency trends need further investigation. The interaction of model size, data quantity, and computational resources is characterized to influence performance. This research fills a gap in understanding scaling laws for LLMs during post-training reinforcement learning, which has received less attention than pre-training. The paper is cataloged as arXiv:2509.25300v4 and classified as a replace-cross announcement.

Key facts

The study investigates scaling behaviors of large language models under reinforcement learning post-training
Focus is specifically on mathematical reasoning applications
Research covers the full Qwen2.5 dense model series from 0.5B to 72B parameters
Larger models consistently show superior learning efficiency on compute and data metrics
A predictive power-law relationship between test loss, compute, and data is identified
The power-law relationship is robust across both base and instruction-tuned models
The paper is published as arXiv:2509.25300v4 with announcement type replace-cross
The study examines interactions between model scale, data volume, and computational budget

Entities

—

Sources

arXiv cs.AI — 2026-04-20