Smaller LLMs Boost Policy Diversity in GRPO Training
A recent paper on arXiv (2605.30789) indicates that smaller language models from the same family demonstrate greater diversity at the policy level compared to their larger counterparts during Group Relative Policy Optimization (GRPO). This increased diversity is temporally linked, maintains logical coherence, and offers organized exploration signals for gradient estimation, in contrast to token-level randomness that may add noise. The researchers introduce S2L-PO (Small-to-Large Policy Optimization), a framework that leverages fixed smaller models as effective explorers to enhance the training of larger models, utilizing a gradual annealing approach to balance exploration with exploitation.
Key facts
- Smaller models show higher policy-level diversity in GRPO
- Diversity is temporally correlated and preserves logical consistency
- Token-level randomness can introduce step-wise noise
- S2L-PO framework uses small models as explorers for larger models
- Progressive annealing strategy balances exploration and exploitation
- Paper published on arXiv with ID 2605.30789
- Announce type: cross
Entities
Institutions
- arXiv