MACS Framework Enhances Multimodal MoE Inference Efficiency
Researchers propose MACS (Modality-Aware Capacity Scaling), a training-free inference framework to address efficiency bottlenecks in Mixture-of-Experts Multimodal Large Language Models (MoE MLLMs) during Expert Parallelism (EP) inference. The straggler effect is worsened in multimodal contexts due to information heterogeneity, where redundant visual tokens are treated equally to critical ones, and modality dynamics, where varying visual-to-text ratios cause resource misallocation. MACS introduces an Entropy-Weighted Load mechanism to quantify semantic value of visual tokens and a Dynamic Modality-Adaptive Capacity mechanism to allocate expert resources based on real-time modal composition. The framework is detailed in arXiv:2605.05225.
Key facts
- MACS is a training-free inference framework
- Addresses efficiency bottleneck in MoE MLLMs during EP inference
- Two challenges: Information Heterogeneity and Modality Dynamics
- Entropy-Weighted Load mechanism quantifies semantic value of visual tokens
- Dynamic Modality-Adaptive Capacity mechanism allocates expert resources based on real-time modal composition
- Published on arXiv with ID 2605.05225
- Announce type: cross
- Proposed by researchers (authors not specified in abstract)
Entities
Institutions
- arXiv