First Independent Evaluation of NVIDIA's CuTile on Hopper and Blackwell GPUs
A recent study released on arXiv evaluates NVIDIA's new CuTile, a Python framework for creating GPU kernels optimized for tile-based processing. The analysis compares CuTile's performance with established libraries like cuBLAS, Triton, and WMMA using three different NVIDIA GPUs: H100 NVL, B200, and the RTX PRO 6000. Multiple AI benchmarks were tested, including GEMM and large language model inference in BF16/FP16 precision. The findings indicate that CuTile’s efficiency is workload and architecture-dependent, with the Blackwell B200 achieving remarkable performance of 1007 TFLOP/s for fused attention, significantly surpassing FlashAttention-2 while using minimal code.
Key facts
- First independent evaluation of NVIDIA's CuTile
- CuTile is a Python-based, tile-centric abstraction for GPU kernel development
- Benchmarked against cuBLAS, Triton, WMMA, and raw SIMT
- Tested on H100 NVL, B200, and RTX PRO 6000 Blackwell Server Edition GPUs
- Workloads: GEMM, fused multi-head attention, end-to-end LLM inference
- Precision used: BF16/FP16
- On B200, CuTile achieved up to 1007 TFLOP/s for fused attention
- CuTile outperformed FlashAttention-2 by 2.5x on B200
- CuTile required only 60 lines of Python kernel code for fused attention
- CuTile effectiveness is workload- and architecture-dependent
Entities
Institutions
- NVIDIA
- arXiv