Dense2MoE: Unified Pruning and Upcycling for Efficient On-Device LLMs

ai-technology · 2026-05-27

Researchers propose Dense2MoE, a framework that combines pruning and upcycling to create efficient Mixture of Experts (MoE) models for on-device deployment. The method, called Layer Fusion UpCycling (LF-UC), prunes bandwidth-heavy attention modules from redundant layers and repurposes their MLPs as MoE experts. This preserves core model capabilities while limiting active parameters via selective token routing. Dense2MoE is guided by hardware Roofline theory to overcome the inference memory wall. The approach addresses the trade-off between parameter redundancy and model accuracy, achieving better Pareto frontier for on-device LLMs.

Key facts

Dense2MoE unifies pruning and upcycling for on-device LLMs
Layer Fusion UpCycling (LF-UC) prunes attention modules and repurposes MLPs as MoE experts
Guided by hardware Roofline theory to overcome inference memory wall
Selective token routing limits active parameters
Aims to improve Pareto frontier for on-device LLM efficiency

Entities

—

Sources

arXiv cs.AI — 2026-05-27