Google Researchers Introduce Ragged Paged Attention for Efficient LLM Inference on TPUs
A recent technical paper presents Ragged Paged Attention (RPA), an advanced attention kernel tailored for Tensor Processing Units (TPUs). This innovation, developed by a team of researchers, tackles the issue of effectively adapting Large Language Model (LLM) workloads to TPU architectures, which are gaining popularity for economical deployment. Current inference systems are primarily optimized for GPUs, leaving a gap in TPU-based serving solutions. RPA incorporates three main strategies: fine-grained tiling for dynamic slicing of ragged memory, a custom software pipeline that merges KV cache updates with attention calculations, and a distribution-aware compilation method that produces specialized kernels. Utilizing Pallas and Mosaic, this approach emphasizes performance and total cost of ownership in modern serving contexts with dynamic, ragged execution patterns. The paper, arXiv:2604.15464v1, was released as a cross-disciplinary abstract, underscoring the transition towards TPU accelerators for LLM deployment.
Key facts
- Ragged Paged Attention (RPA) is a new attention kernel for TPUs
- It addresses inefficient mapping of LLM workloads onto TPU architectures
- Existing LLM inference kernels are largely GPU-centric
- RPA uses fine-grained tiling for dynamic slicing over ragged memory
- It features a custom software pipeline fusing KV cache updates with attention computation
- A distribution-aware compilation strategy generates specialized kernels
- Implemented using Pallas and Mosaic
- Prioritizes performance and total cost of ownership for TPU deployment