First Independent Evaluation of NVIDIA's CuTile on Hopper and Blackwell GPUs

ai-technology · 2026-05-01

A recent study released on arXiv evaluates NVIDIA's new CuTile, a Python framework for creating GPU kernels optimized for tile-based processing. The analysis compares CuTile's performance with established libraries like cuBLAS, Triton, and WMMA using three different NVIDIA GPUs: H100 NVL, B200, and the RTX PRO 6000. Multiple AI benchmarks were tested, including GEMM and large language model inference in BF16/FP16 precision. The findings indicate that CuTile’s efficiency is workload and architecture-dependent, with the Blackwell B200 achieving remarkable performance of 1007 TFLOP/s for fused attention, significantly surpassing FlashAttention-2 while using minimal code.

Key facts

First independent evaluation of NVIDIA's CuTile
CuTile is a Python-based, tile-centric abstraction for GPU kernel development
Benchmarked against cuBLAS, Triton, WMMA, and raw SIMT
Tested on H100 NVL, B200, and RTX PRO 6000 Blackwell Server Edition GPUs
Workloads: GEMM, fused multi-head attention, end-to-end LLM inference
Precision used: BF16/FP16
On B200, CuTile achieved up to 1007 TFLOP/s for fused attention
CuTile outperformed FlashAttention-2 by 2.5x on B200
CuTile required only 60 lines of Python kernel code for fused attention
CuTile effectiveness is workload- and architecture-dependent

First Independent Evaluation of NVIDIA's CuTile on Hopper and Blackwell GPUs

Key facts

Entities

Institutions

Sources