Google Researchers Introduce Ragged Paged Attention for Efficient LLM Inference on TPUs

ai-technology · 2026-04-20

A recent technical paper presents Ragged Paged Attention (RPA), an advanced attention kernel tailored for Tensor Processing Units (TPUs). This innovation, developed by a team of researchers, tackles the issue of effectively adapting Large Language Model (LLM) workloads to TPU architectures, which are gaining popularity for economical deployment. Current inference systems are primarily optimized for GPUs, leaving a gap in TPU-based serving solutions. RPA incorporates three main strategies: fine-grained tiling for dynamic slicing of ragged memory, a custom software pipeline that merges KV cache updates with attention calculations, and a distribution-aware compilation method that produces specialized kernels. Utilizing Pallas and Mosaic, this approach emphasizes performance and total cost of ownership in modern serving contexts with dynamic, ragged execution patterns. The paper, arXiv:2604.15464v1, was released as a cross-disciplinary abstract, underscoring the transition towards TPU accelerators for LLM deployment.

Key facts

Ragged Paged Attention (RPA) is a new attention kernel for TPUs
It addresses inefficient mapping of LLM workloads onto TPU architectures
Existing LLM inference kernels are largely GPU-centric
RPA uses fine-grained tiling for dynamic slicing over ragged memory
It features a custom software pipeline fusing KV cache updates with attention computation
A distribution-aware compilation strategy generates specialized kernels
Implemented using Pallas and Mosaic
Prioritizes performance and total cost of ownership for TPU deployment

Google Researchers Introduce Ragged Paged Attention for Efficient LLM Inference on TPUs

Key facts

Entities

Institutions

Sources