ARTFEED — Contemporary Art Intelligence

AMMA: Memory-Centric Architecture for Low-Latency Long-Context LLM Attention

ai-technology · 2026-04-30

AMMA has unveiled a new design that focuses on memory by utilizing a multi-chiplet approach to cut down on latency in long-context attention for large language models (LLMs). Instead of relying on GPUs, which struggle with the heavy memory demands during the decode phase, AMMA integrates HBM-PNM cubes to improve performance. This change effectively doubles memory bandwidth, which is crucial for tasks reliant on memory. As context lengths grow to one million tokens in reasoning and agent applications, attention latency has become a major challenge for users. The new architecture aims to improve efficiency across LLM serving frameworks, including attention-FFN disaggregation and NVIDIA's Rubin GPU-LPU platform.

Key facts

  • AMMA is a multi-chiplet, memory-centric architecture for low-latency long-context attention.
  • Current LLM serving systems place the GPU at the center, which is mismatched with memory-bound decode-phase attention.
  • AMMA replaces GPU compute dies with HBM-PNM cubes.
  • AMMA roughly doubles available memory bandwidth.
  • Context lengths are pushing toward one million tokens in reasoning and agentic workloads.
  • Attention latency is the primary user-facing bottleneck for long contexts.
  • The architecture targets production-level attention-FFN disaggregation and NVIDIA's Rubin GPU-LPU platform.
  • The paper is from arXiv (2604.26103).

Entities

Institutions

  • arXiv
  • NVIDIA

Sources