ReMoE: Router Fine-Tuning Boosts Expert Reuse in Memory-Constrained MoE LLMs
ReMoE serves as a framework for fine-tuning routers aimed at improving the reuse of experts in Mixture-of-Experts (MoE) large language models, particularly during memory-limited inference. These fine-grained MoE models activate a limited number of experts for each token, which decreases computational demands but necessitates frequent access to slower external storage when experts aren’t cached. By favoring recently chosen experts, ReMoE achieves stable temporal routing that complements cache locality, thereby minimizing expert fetches without increasing computation during inference. Tests conducted on the DeepSeek and Qwen models demonstrate a 26% rise in expert reuse while preserving performance on downstream tasks. Evaluations in real systems validate these advantages.
Key facts
- ReMoE is a router fine-tuning framework for MoE LLMs.
- It boosts token-wise expert reuse in memory-constrained scenarios.
- Only a small set of experts can be cached; others are fetched from slow UFS.
- ReMoE biases the router toward recently selected experts.
- It produces temporally stable routing matching cache locality.
- Experiments on DeepSeek and Qwen models show 26% improvement in expert reuse.
- Downstream task performance is maintained.
- Real-system evaluations confirm the benefits.
Entities
—