BEAM: Binary Expert Activation Masking for Efficient MoE

ai-technology · 2026-05-16

Researchers propose BEAM (Binary Expert Activation Masking), a method to improve Mixture-of-Experts (MoE) efficiency in large language models. Standard MoE uses fixed Top-K routing, causing redundant computation. BEAM learns token-adaptive expert selection via trainable binary masks, using a straight-through estimator and auxiliary regularization loss. An efficient custom CUDA kernel integrates with vLLM inference framework. Experiments show BEAM retains model performance while reducing inference latency.

Key facts

BEAM stands for Binary Expert Activation Masking.
Method addresses fixed Top-K routing inefficiency in MoE.
Uses trainable binary masks for token-adaptive expert selection.
Straight-through estimator and auxiliary regularization loss enable end-to-end training.
Custom CUDA kernel implemented for vLLM inference framework.
Aims to reduce redundant computation and inference latency.
Published on arXiv with ID 2605.14438.
Experiments show performance retention at high sparsity.

BEAM: Binary Expert Activation Masking for Efficient MoE

Key facts

Entities

Institutions

Sources