Prefix-RFT: Hybrid LLM Post-Training Method
A new hybrid approach to large language model post-training, Prefix-RFT, combines supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT) to overcome their respective limitations. SFT excels at mimicking demonstration data but suffers from behavior cloning, while RFT enhances performance but is sensitive to initial policy and prone to unexpected behaviors. Prefix-RFT synergizes learning from both demonstration and exploration, using mathematical reasoning problems as a test bed. The method outperforms standalone SFT, standalone RFT, and parallel mixed-policy RFT. The paper highlights the complementary nature of SFT and RFT, proposing a unified view of these techniques.
Key facts
- Prefix-RFT is a hybrid approach combining SFT and RFT
- SFT excels at mimicking demonstration data but can lead to problematic generalization
- RFT enhances performance but is sensitive to initial policy
- Prefix-RFT outperforms standalone SFT and RFT
- Prefix-RFT outperforms parallel mixed-policy RFT
- Mathematical reasoning problems were used as test bed
- The approach is described as simple yet effective
- The paper proposes a unified view of SFT and RFT
Entities
—