ARTFEED — Contemporary Art Intelligence

Dense2MoE: Unified Pruning and Upcycling for Efficient On-Device LLMs

ai-technology · 2026-05-27

Researchers propose Dense2MoE, a framework that combines pruning and upcycling to create efficient Mixture of Experts (MoE) models for on-device deployment. The method, called Layer Fusion UpCycling (LF-UC), prunes bandwidth-heavy attention modules from redundant layers and repurposes their MLPs as MoE experts. This preserves core model capabilities while limiting active parameters via selective token routing. Dense2MoE is guided by hardware Roofline theory to overcome the inference memory wall. The approach addresses the trade-off between parameter redundancy and model accuracy, achieving better Pareto frontier for on-device LLMs.

Key facts

  • Dense2MoE unifies pruning and upcycling for on-device LLMs
  • Layer Fusion UpCycling (LF-UC) prunes attention modules and repurposes MLPs as MoE experts
  • Guided by hardware Roofline theory to overcome inference memory wall
  • Selective token routing limits active parameters
  • Aims to improve Pareto frontier for on-device LLM efficiency

Entities

Sources