FEST: Few-Shot Demonstration-Guided RLVR Boosts LLM Sample Efficiency
A new algorithm called FEST (FEw-ShoT demonstration-guided RLVR) has been introduced by researchers to enhance sample efficiency in Reinforcement Learning with Verifiable Rewards (RLVR) for Large Language Models (LLMs). This method utilizes merely 128 randomly chosen demonstrations from a supervised fine-tuning (SFT) dataset, eliminating the need for expensive large-scale supervised fine-tuning. Its effectiveness is attributed to three essential elements: a supervised signal, an on-policy signal, and diminishing weights on the limited SFT dataset to mitigate overfitting. In benchmark tests, FEST surpasses existing baselines, providing a data-efficient solution for mathematical and coding challenges where accurate rollouts are limited.
Key facts
- FEST is a few-shot demonstration-guided RLVR algorithm.
- It uses only 128 demonstrations randomly selected from an SFT dataset.
- Three components: supervised signal, on-policy signal, decaying weights.
- Decaying weights prevent overfitting from multiple-epoch training.
- FEST outperforms baselines on several benchmarks.
- RLVR has been successful for math and coding tasks.
- Prior works used SFT when RL fails, but SFT requires large data.
- The paper is arXiv:2605.15012.
Entities
—