GPS: A Lightweight Model for Efficient RL Post-Training of LLMs

other · 2026-05-18

Researchers introduce Generalizable Predictive Prompt Selection (GPS), a method to improve reinforcement learning (RL) post-training efficiency for large reasoning models. GPS uses a small generative model to predict prompt difficulty via Bayesian inference on shared optimization history, enabling online prompt selection without costly exact evaluations. The approach prioritizes intermediate-difficulty prompts and incorporates history-anchored diversity for batch acquisition. Experiments across varied reasoning tasks show that GPS generalizes at test-time, reducing computational costs while maintaining performance. The paper is available on arXiv (2602.01970).

Key facts

GPS performs Bayesian inference on prompt difficulty using a lightweight generative model.
It uses intermediate-difficulty prioritization and history-anchored diversity for batch selection.
The method generalizes at test-time for efficient computational allocation.
Experiments were conducted across varied reasoning tasks.
The paper is arXiv:2602.01970.
GPS aims to reduce high computational costs of rollout-intensive RL optimization.
Current methods depend on costly exact evaluations or lack generalization across prompts.
GPS is designed for online prompt selection in RL post-training of large reasoning models.

GPS: A Lightweight Model for Efficient RL Post-Training of LLMs

Key facts

Entities

Institutions

Sources