ARTFEED — Contemporary Art Intelligence

ZipMoE: Efficient On-Device MoE Serving via Lossless Compression and Cache-Affinity Scheduling

ai-technology · 2026-05-25

ZipMoE is a system for serving Mixture-of-Experts (MoE) large language models on edge devices without lossy quantization. It uses a caching-scheduling co-design to shift inference from I/O-bound to compute-centric, enabling efficient parallelization. Experiments on representative edge platforms with open-source MoE models demonstrate its effectiveness.

Key facts

  • ZipMoE is a semantically lossless on-device MoE serving system.
  • It exploits synergy between edge device hardware and statistical redundancy in MoE parameters.
  • The design shifts inference from I/O-bound to compute-centric workflow.
  • A prototype was implemented and tested on representative edge computing platforms.
  • Experiments used popular open-source MoE models.
  • The system provides provable performance guarantees.
  • It avoids lossy quantization to preserve model behavior.
  • ZipMoE enables efficient parallelization on edge devices.

Entities

Sources