ZipMoE: Efficient On-Device MoE Serving via Lossless Compression and Cache-Affinity Scheduling

ai-technology · 2026-05-25

ZipMoE is a system for serving Mixture-of-Experts (MoE) large language models on edge devices without lossy quantization. It uses a caching-scheduling co-design to shift inference from I/O-bound to compute-centric, enabling efficient parallelization. Experiments on representative edge platforms with open-source MoE models demonstrate its effectiveness.

Key facts

ZipMoE is a semantically lossless on-device MoE serving system.
It exploits synergy between edge device hardware and statistical redundancy in MoE parameters.
The design shifts inference from I/O-bound to compute-centric workflow.
A prototype was implemented and tested on representative edge computing platforms.
Experiments used popular open-source MoE models.
The system provides provable performance guarantees.
It avoids lossy quantization to preserve model behavior.
ZipMoE enables efficient parallelization on edge devices.

Entities

—

Sources

arXiv cs.AI — 2026-05-25