ARTFEED — Contemporary Art Intelligence

MTServe: Hierarchical Cache System for Generative Recommendation

other · 2026-04-29

MTServe is a hierarchical cache management system designed to reduce inference costs in generative recommendation models. It virtualizes GPU memory by using host RAM as backup storage, addressing storage explosion from long user histories. The system introduces a hybrid storage layout, asynchronous data transfer pipeline, and locality-driven replacement policy. On public and production datasets, MTServe achieves up to 3.1x speedup with hit ratios exceeding 98.5%.

Key facts

  • Generative recommendation suffers from high inference costs due to repeated encoding of long user histories.
  • Cross-request KV cache reuse is a key optimization but causes storage explosion beyond GPU limits.
  • MTServe virtualizes GPU memory using host RAM as a scalable backup store.
  • Optimizations include hybrid storage layout, asynchronous data transfer, and locality-driven replacement.
  • MTServe delivers up to 3.1x speedup on public and production datasets.
  • Hit ratios exceed 98.5%.
  • The system is proposed in a paper on arXiv (2604.22881).
  • The paper is categorized under Computer Science > Machine Learning.

Entities

Institutions

  • arXiv

Sources