ARTFEED — Contemporary Art Intelligence

First Independent Evaluation of NVIDIA's CuTile on Hopper and Blackwell GPUs

ai-technology · 2026-05-01

A recent study released on arXiv evaluates NVIDIA's new CuTile, a Python framework for creating GPU kernels optimized for tile-based processing. The analysis compares CuTile's performance with established libraries like cuBLAS, Triton, and WMMA using three different NVIDIA GPUs: H100 NVL, B200, and the RTX PRO 6000. Multiple AI benchmarks were tested, including GEMM and large language model inference in BF16/FP16 precision. The findings indicate that CuTile’s efficiency is workload and architecture-dependent, with the Blackwell B200 achieving remarkable performance of 1007 TFLOP/s for fused attention, significantly surpassing FlashAttention-2 while using minimal code.

Key facts

  • First independent evaluation of NVIDIA's CuTile
  • CuTile is a Python-based, tile-centric abstraction for GPU kernel development
  • Benchmarked against cuBLAS, Triton, WMMA, and raw SIMT
  • Tested on H100 NVL, B200, and RTX PRO 6000 Blackwell Server Edition GPUs
  • Workloads: GEMM, fused multi-head attention, end-to-end LLM inference
  • Precision used: BF16/FP16
  • On B200, CuTile achieved up to 1007 TFLOP/s for fused attention
  • CuTile outperformed FlashAttention-2 by 2.5x on B200
  • CuTile required only 60 lines of Python kernel code for fused attention
  • CuTile effectiveness is workload- and architecture-dependent

Entities

Institutions

  • NVIDIA
  • arXiv

Sources