ZipMoE: Efficient On-Device MoE Serving via Lossless Compression and Cache-Affinity Scheduling
ZipMoE is a system for serving Mixture-of-Experts (MoE) large language models on edge devices without lossy quantization. It uses a caching-scheduling co-design to shift inference from I/O-bound to compute-centric, enabling efficient parallelization. Experiments on representative edge platforms with open-source MoE models demonstrate its effectiveness.
Key facts
- ZipMoE is a semantically lossless on-device MoE serving system.
- It exploits synergy between edge device hardware and statistical redundancy in MoE parameters.
- The design shifts inference from I/O-bound to compute-centric workflow.
- A prototype was implemented and tested on representative edge computing platforms.
- Experiments used popular open-source MoE models.
- The system provides provable performance guarantees.
- It avoids lossy quantization to preserve model behavior.
- ZipMoE enables efficient parallelization on edge devices.
Entities
—