BEAM: Binary Expert Activation Masking for Efficient MoE
Researchers propose BEAM (Binary Expert Activation Masking), a method to improve Mixture-of-Experts (MoE) efficiency in large language models. Standard MoE uses fixed Top-K routing, causing redundant computation. BEAM learns token-adaptive expert selection via trainable binary masks, using a straight-through estimator and auxiliary regularization loss. An efficient custom CUDA kernel integrates with vLLM inference framework. Experiments show BEAM retains model performance while reducing inference latency.
Key facts
- BEAM stands for Binary Expert Activation Masking.
- Method addresses fixed Top-K routing inefficiency in MoE.
- Uses trainable binary masks for token-adaptive expert selection.
- Straight-through estimator and auxiliary regularization loss enable end-to-end training.
- Custom CUDA kernel implemented for vLLM inference framework.
- Aims to reduce redundant computation and inference latency.
- Published on arXiv with ID 2605.14438.
- Experiments show performance retention at high sparsity.
Entities
Institutions
- arXiv