ARTFEED — Contemporary Art Intelligence

CATS: Self-Speculative Decoding for Memory-Limited LLM Inference

ai-technology · 2026-05-13

A new study on arXiv (2605.11186) introduces CATS, which stands for Cascaded Adaptive Tree Speculation. This framework is designed to speed up the inference of Large Language Models (LLMs) on devices with limited memory. In LLMs, the process of auto-regressive decoding is hindered by memory limitations, as each decoding step requires accessing model weights and interim results from memory like High-Bandwidth Memory on GPUs. Speculative decoding helps by allowing multiple draft tokens to be verified at once, sharing the load of calling the target model. However, existing methods assume there's enough HBM for both the target and a draft model, which isn’t practical for devices with low DRAM. CATS solves this by using a cascaded approach to verification and correction that fits within the memory constraints.

Key facts

  • arXiv paper ID: 2605.11186
  • Published on arXiv
  • Introduces CATS framework
  • CATS stands for Cascaded Adaptive Tree Speculation
  • Addresses memory-bound LLM inference
  • Targets memory-constrained devices like edge platforms
  • Existing speculative decoding assumes large HBM
  • CATS uses cascaded verification and correction

Entities

Institutions

  • arXiv

Sources