CATS: Self-Speculative Decoding for Memory-Limited LLM Inference

ai-technology · 2026-05-13

A new study on arXiv (2605.11186) introduces CATS, which stands for Cascaded Adaptive Tree Speculation. This framework is designed to speed up the inference of Large Language Models (LLMs) on devices with limited memory. In LLMs, the process of auto-regressive decoding is hindered by memory limitations, as each decoding step requires accessing model weights and interim results from memory like High-Bandwidth Memory on GPUs. Speculative decoding helps by allowing multiple draft tokens to be verified at once, sharing the load of calling the target model. However, existing methods assume there's enough HBM for both the target and a draft model, which isn’t practical for devices with low DRAM. CATS solves this by using a cascaded approach to verification and correction that fits within the memory constraints.

Key facts

arXiv paper ID: 2605.11186
Published on arXiv
Introduces CATS framework
CATS stands for Cascaded Adaptive Tree Speculation
Addresses memory-bound LLM inference
Targets memory-constrained devices like edge platforms
Existing speculative decoding assumes large HBM
CATS uses cascaded verification and correction

CATS: Self-Speculative Decoding for Memory-Limited LLM Inference

Key facts

Entities

Institutions

Sources