Smaller LLMs Boost Policy Diversity in GRPO Training

ai-technology · 2026-06-01

A recent paper on arXiv (2605.30789) indicates that smaller language models from the same family demonstrate greater diversity at the policy level compared to their larger counterparts during Group Relative Policy Optimization (GRPO). This increased diversity is temporally linked, maintains logical coherence, and offers organized exploration signals for gradient estimation, in contrast to token-level randomness that may add noise. The researchers introduce S2L-PO (Small-to-Large Policy Optimization), a framework that leverages fixed smaller models as effective explorers to enhance the training of larger models, utilizing a gradual annealing approach to balance exploration with exploitation.

Key facts

Smaller models show higher policy-level diversity in GRPO
Diversity is temporally correlated and preserves logical consistency
Token-level randomness can introduce step-wise noise
S2L-PO framework uses small models as explorers for larger models
Progressive annealing strategy balances exploration and exploitation
Paper published on arXiv with ID 2605.30789
Announce type: cross

Smaller LLMs Boost Policy Diversity in GRPO Training

Key facts

Entities

Institutions

Sources