Post-Training Method Converts Dense LLMs to Sparse MoE Architectures

ai-technology · 2026-04-25

An innovative analytical framework has been introduced by researchers that transforms feed-forward networks (FFNs) in large language models (LLMs) into sparse Mixture-of-Experts (MoE) architectures, utilizing only a limited calibration dataset. By examining neuron activation patterns, the method categorizes neurons into consistently active shared experts and conditionally active routed experts, subsequently creating a router based on representative neuron statistics. This allows for immediate implementation or optional lightweight fine-tuning, circumventing the need for extensive retraining on hundreds of billions of tokens. The technique is applicable to both dense models and existing MoE models for hierarchical sparsity. While scaling LLMs enhances performance, it also raises inference costs, primarily due to FFNs, which consume the majority of computational resources. MoE architectures mitigate these expenses through sparse activation, yet transforming dense models into MoEs usually necessitates significant retraining. This framework effectively resolves that challenge by facilitating swift conversion with minimal data.

Key facts

Framework restructures FFNs into sparse MoE architectures post-training
Uses only a small calibration dataset
Analyzes neuron activation patterns to partition neurons into shared and routed experts
Router constructed analytically from representative neuron statistics
Enables immediate deployment or optional lightweight fine-tuning
Applies to dense models and recursively to existing MoE models
Avoids retraining on hundreds of billions of tokens
Reduces inference costs of LLMs through sparse activation

Post-Training Method Converts Dense LLMs to Sparse MoE Architectures

Key facts

Entities

Institutions

Sources