ODMA Memory Allocation Strategy Improves LLM Serving on LPDDR-Class Accelerators

ai-technology · 2026-04-22

A new memory management technique called ODMA addresses inefficiencies in Large Language Model serving on accelerators with limited random-access bandwidth. Traditional approaches like static pre-allocation require worst-case memory provisioning, creating significant overhead. Fine-grained paging methods depend on High Bandwidth Memory's tolerance for random access, making them unsuitable for LPDDR systems where non-sequential access drastically reduces bandwidth. Previous solutions typically assumed static distributions and HBM characteristics, failing to solve fragmentation and bandwidth constraints specific to LPDDR hardware. ODMA is specifically designed for random-access-constrained accelerators such as the Cambricon MLU series. The strategy advances generation-length prediction by tackling two key limitations in production workloads: distribution drift and other unspecified challenges. This research addresses critical performance bottlenecks in AI infrastructure.

Key facts

ODMA is an on-demand memory allocation strategy for LLM serving
Designed for accelerators with poor random-access bandwidth like LPDDR systems
Addresses limitations of static pre-allocation and fine-grained paging
Targets random-access-constrained accelerators including Cambricon MLU series
Advances generation-length prediction in production workloads
Solves fragmentation and bandwidth constraints in LPDDR hardware
Previous methods assumed static distributions and HBM characteristics
Research published on arXiv with identifier 2512.09427v5

ODMA Memory Allocation Strategy Improves LLM Serving on LPDDR-Class Accelerators

Key facts

Entities

Institutions

Sources