MACS Framework Enhances Multimodal MoE Inference Efficiency

ai-technology · 2026-05-09

Researchers propose MACS (Modality-Aware Capacity Scaling), a training-free inference framework to address efficiency bottlenecks in Mixture-of-Experts Multimodal Large Language Models (MoE MLLMs) during Expert Parallelism (EP) inference. The straggler effect is worsened in multimodal contexts due to information heterogeneity, where redundant visual tokens are treated equally to critical ones, and modality dynamics, where varying visual-to-text ratios cause resource misallocation. MACS introduces an Entropy-Weighted Load mechanism to quantify semantic value of visual tokens and a Dynamic Modality-Adaptive Capacity mechanism to allocate expert resources based on real-time modal composition. The framework is detailed in arXiv:2605.05225.

Key facts

MACS is a training-free inference framework
Addresses efficiency bottleneck in MoE MLLMs during EP inference
Two challenges: Information Heterogeneity and Modality Dynamics
Entropy-Weighted Load mechanism quantifies semantic value of visual tokens
Dynamic Modality-Adaptive Capacity mechanism allocates expert resources based on real-time modal composition
Published on arXiv with ID 2605.05225
Announce type: cross
Proposed by researchers (authors not specified in abstract)

MACS Framework Enhances Multimodal MoE Inference Efficiency

Key facts

Entities

Institutions

Sources