Branch-Merge Distillation Boosts TinyR1-32B-Preview Accuracy
A novel two-phase technique called Branch-Merge distillation has been developed by researchers to compress large language models while enhancing their accuracy. The initial phase, known as Branch, involves a significant teacher model (DeepSeek-R1) that distills knowledge into specialized student models through domain-specific supervised fine-tuning. Following this, the Merge phase integrates these student models, facilitating cross-domain knowledge transfer and improving generalization. The new merged model, TinyR1-32B-Preview, surpasses the performance of DeepSeek-R1-Distill-Qwen-32B on various benchmarks. This method effectively overcomes the shortcomings of current distillation and transfer learning approaches that struggle to retain high accuracy. The research paper can be found on arXiv with the ID 2503.04872.
Key facts
- Branch-Merge distillation has two phases: Branch and Merge.
- DeepSeek-R1 is used as the teacher model.
- DeepSeek-R1-Distill-Qwen-32B is the student model.
- TinyR1-32B-Preview is the resulting merged model.
- TinyR1-32B-Preview outperforms DeepSeek-R1-Distill-Qwen-32B.
- The method improves model compression while maintaining performance.
- Existing methods like model distillation and transfer learning often fail to achieve high accuracy.
- The paper is arXiv:2503.04872.
Entities
Institutions
- arXiv