Multi-Teacher Bayesian Knowledge Distillation for LLM Compression
A new method called Multi-Teacher Bayesian Knowledge Distillation (MT-BKD) has been introduced for compressing large language models. The approach uses Bayesian inference to capture uncertainty in the distillation process and incorporates a teacher-informed prior that integrates external knowledge from multiple teacher models and task-specific training data. An entropy-based weighting mechanism adaptively adjusts each teacher's influence. The method aims to improve generalization, robustness, and scalability in model compression.
Key facts
- Method is called Multi-Teacher Bayesian Knowledge Distillation (MT-BKD)
- Uses Bayesian inference to capture uncertainty
- Introduces a teacher-informed prior integrating external knowledge
- Employs entropy-based weighting for teacher influence
- Aims to improve generalization, robustness, and scalability
- Addresses challenges in real-world scenarios with diverse teacher expertise
- Underlying statistical mechanisms of knowledge distillation are unclear
- Uncertainty evaluation is often overlooked in current methods
Entities
—