ARTFEED — Contemporary Art Intelligence

BEAM: Binary Expert Activation Masking for Efficient MoE

ai-technology · 2026-05-16

Researchers propose BEAM (Binary Expert Activation Masking), a method to improve Mixture-of-Experts (MoE) efficiency in large language models. Standard MoE uses fixed Top-K routing, causing redundant computation. BEAM learns token-adaptive expert selection via trainable binary masks, using a straight-through estimator and auxiliary regularization loss. An efficient custom CUDA kernel integrates with vLLM inference framework. Experiments show BEAM retains model performance while reducing inference latency.

Key facts

  • BEAM stands for Binary Expert Activation Masking.
  • Method addresses fixed Top-K routing inefficiency in MoE.
  • Uses trainable binary masks for token-adaptive expert selection.
  • Straight-through estimator and auxiliary regularization loss enable end-to-end training.
  • Custom CUDA kernel implemented for vLLM inference framework.
  • Aims to reduce redundant computation and inference latency.
  • Published on arXiv with ID 2605.14438.
  • Experiments show performance retention at high sparsity.

Entities

Institutions

  • arXiv

Sources