QFlash: Integer-Only FlashAttention for Vision Transformers
A team of researchers has unveiled QFlash, a new integer-only design for FlashAttention that performs softmax entirely in the integer domain and functions as one Triton kernel. This development addresses three key issues with integer-only FlashAttention: the rapid growth during tile-wise accumulation, the inefficiency of shift-based exponential calculations on GPUs, and the limitations of quantization granularity that require uniform scales for integer comparisons. In tests involving seven attention tasks from ViT, DeiT, and Swin models, QFlash achieved speed improvements of up to 6.73× over I-ViT and 8.69× over Swin. Additionally, it reduced energy usage by 18.8% compared to FP16 FlashAttention while preserving Top-1 accuracy on ViT/DeiT and remaining competitive with Swin.
Key facts
- QFlash is an end-to-end integer FlashAttention design.
- It performs softmax entirely in the integer domain.
- It runs as a single Triton kernel.
- Three obstacles addressed: scale explosion, inefficient exponential operations, quantization granularity.
- Tested on ViT, DeiT, and Swin models.
- Up to 6.73× speedup over I-ViT.
- Up to 8.69× speedup on Swin.
- 18.8% energy reduction vs FP16 FlashAttention.
- No Top-1 accuracy loss on ViT/DeiT.
Entities
Institutions
- arXiv