GPS: A Lightweight Model for Efficient RL Post-Training of LLMs
Researchers introduce Generalizable Predictive Prompt Selection (GPS), a method to improve reinforcement learning (RL) post-training efficiency for large reasoning models. GPS uses a small generative model to predict prompt difficulty via Bayesian inference on shared optimization history, enabling online prompt selection without costly exact evaluations. The approach prioritizes intermediate-difficulty prompts and incorporates history-anchored diversity for batch acquisition. Experiments across varied reasoning tasks show that GPS generalizes at test-time, reducing computational costs while maintaining performance. The paper is available on arXiv (2602.01970).
Key facts
- GPS performs Bayesian inference on prompt difficulty using a lightweight generative model.
- It uses intermediate-difficulty prioritization and history-anchored diversity for batch selection.
- The method generalizes at test-time for efficient computational allocation.
- Experiments were conducted across varied reasoning tasks.
- The paper is arXiv:2602.01970.
- GPS aims to reduce high computational costs of rollout-intensive RL optimization.
- Current methods depend on costly exact evaluations or lack generalization across prompts.
- GPS is designed for online prompt selection in RL post-training of large reasoning models.
Entities
Institutions
- arXiv