IO-Aware GPU Kernels Achieve 3.9x Speedup for Graph Neural Networks
A recent preprint on arXiv (ID 2605.31500) introduces GPU kernel implementations that are aware of input/output (IO) to enhance the performance of Graph Neural Networks (GNNs), which often face challenges due to sparse and irregular memory access patterns. The researchers classify standard GNN layers into three categories: SpMM-based convolutions, reduction-based aggregations, and attention-based layers (GATv2/Graph Transformer). They create specialized GPU kernels for each category that minimize data movement and enhance locality. The research also investigates graph reordering, revealing that its advantages vary based on kernel mapping—showing more consistency in neighbor-parallel (gather-dominated) kernels compared to feature-parallel ones. Notably, the fused attention kernels demonstrate up to a 3.9× speedup for Graph Transformers.
Key facts
- arXiv preprint 2605.31500 proposes IO-aware GPU kernels for GNNs.
- GNNs are bottlenecked by sparse, irregular memory access.
- Layers are categorized into SpMM-based, reduction-based, and attention-based families.
- Custom kernels reduce data movement and improve locality.
- Graph reordering benefits neighbor-parallel kernels more than feature-parallel designs.
- Fused attention kernels achieve up to 3.9× speedup for Graph Transformers.
- Frameworks like DGL and PyTorch Geometric materialize edge-wise intermediates.
- The study takes an I/O- and arithmetic-intensity-centric view.
Entities
Institutions
- arXiv
- DGL
- PyTorch Geometric