ARTFEED — Contemporary Art Intelligence

ELMoE-3D: Hybrid-Bonding Framework Boosts MoE Inference Speed

ai-technology · 2026-04-25

Researchers propose ELMoE-3D, a hardware-software co-designed framework using hybrid bonding (HB) to accelerate Mixture-of-Experts (MoE) model inference in on-premises serving. MoE models are dominant for large language models but suffer from memory-bound bottlenecks due to sparse per-token compute and dense memory activation. Existing memory-centric architectures like PIM and NMP improve bandwidth but underutilize compute at high batch sizes. Speculative decoding (SD) reduces target invocations but still loads experts for rejected tokens, limiting benefit in MoE. ELMoE-3D unifies cache-based acceleration and speculative decoding, leveraging two intrinsic elasticity axes of MoE—expert and bit—to construct Elastic Self-Speculative Decoding (Elastic-SD). The approach scales these axes jointly to offer overall speedup across batch sizes. The paper is available on arXiv under identifier 2604.14626.

Key facts

  • ELMoE-3D uses hybrid bonding (HB) for HW-SW co-design.
  • MoE models are memory-bound in on-premises serving.
  • PIM and NMP architectures improve bandwidth but underutilize compute.
  • Speculative decoding's benefit is limited in MoE due to expert loading for rejected tokens.
  • Elastic-SD scales expert and bit elasticity axes jointly.
  • Framework unifies cache-based acceleration and speculative decoding.
  • Paper available on arXiv (2604.14626).
  • Targets overall speedup across batch sizes.

Entities

Institutions

  • arXiv

Sources