LoPE: Prompt Perturbation Boosts LLM Reasoning in GRPO
Researchers propose Lorem Perturbation for Exploration (LoPE), a training framework that addresses the zero-advantage problem in Group Relative Policy Optimization (GRPO) for large language models. When all sampled rollouts for a query fail, GRPO loses effective training signals. LoPE introduces task-irrelevant prompt-space perturbations to shift the model's output distribution, enabling broader reasoning exploration without increasing sampling budgets. The method aims to improve success rates on complex reasoning tasks.
Key facts
- GRPO suffers from zero-advantage problem when all rollouts fail
- LoPE uses prompt-space perturbations to unlock exploration
- LoPE is a simple yet effective training framework
- Task-irrelevant perturbations shift output distribution
- LoPE aims to improve success rates on complex tasks
- Method does not require increasing sampling budget
- Paper published on arXiv with ID 2605.05566
- LoPE stands for Lorem Perturbation for Exploration
Entities
Institutions
- arXiv