ARTFEED — Contemporary Art Intelligence

ODMA Memory Allocation Strategy Improves LLM Serving on LPDDR-Class Accelerators

ai-technology · 2026-04-22

A new memory management technique called ODMA addresses inefficiencies in Large Language Model serving on accelerators with limited random-access bandwidth. Traditional approaches like static pre-allocation require worst-case memory provisioning, creating significant overhead. Fine-grained paging methods depend on High Bandwidth Memory's tolerance for random access, making them unsuitable for LPDDR systems where non-sequential access drastically reduces bandwidth. Previous solutions typically assumed static distributions and HBM characteristics, failing to solve fragmentation and bandwidth constraints specific to LPDDR hardware. ODMA is specifically designed for random-access-constrained accelerators such as the Cambricon MLU series. The strategy advances generation-length prediction by tackling two key limitations in production workloads: distribution drift and other unspecified challenges. This research addresses critical performance bottlenecks in AI infrastructure.

Key facts

  • ODMA is an on-demand memory allocation strategy for LLM serving
  • Designed for accelerators with poor random-access bandwidth like LPDDR systems
  • Addresses limitations of static pre-allocation and fine-grained paging
  • Targets random-access-constrained accelerators including Cambricon MLU series
  • Advances generation-length prediction in production workloads
  • Solves fragmentation and bandwidth constraints in LPDDR hardware
  • Previous methods assumed static distributions and HBM characteristics
  • Research published on arXiv with identifier 2512.09427v5

Entities

Institutions

  • arXiv
  • Cambricon

Sources