ARTFEED — Contemporary Art Intelligence

New Triton Kernel Reduces Dispatch Overhead in Pruned Vision Transformers

ai-technology · 2026-04-20

A technical paper published on arXiv (ID: 2604.15408v1) addresses inefficiencies in executing pruned Vision Transformers (ViTs). While token pruning methods theoretically reduce attention FLOPs quadratically by eliminating uninformative patches, actual wall-clock latency improvements are limited by dispatch overhead. At typical post-pruning sequence lengths of 197 tokens or fewer, matrix arithmetic completes in microseconds, but host-side dispatch consumes 60-90 microseconds. The researchers developed a lightweight, bidirectional Triton attention kernel with a dispatch floor of approximately 40 microseconds, about 1.5 times lower than FlashAttention-2's variable-length implementation. This kernel is integrated into a complete pack-attend-unpack pipeline. The system achieves up to 2.24 times higher end-to-end throughput compared to padded PyTorch SDPA implementations. The work highlights a bottleneck where existing variable-length attention APIs—including FlashAttention-2's varlen and PyTorch's NestedTensor SDPA—fail to translate FLOP reductions into proportional speed gains. The focus is on making computational savings from pruning more visible in practical runtime performance.

Key facts

  • Paper arXiv ID: 2604.15408v1
  • Addresses dispatch overhead in pruned Vision Transformers
  • Host-side dispatch consumes 60-90 microseconds
  • New Triton kernel reduces dispatch floor to ~40 microseconds
  • Achieves up to 2.24x end-to-end throughput over padded PyTorch SDPA
  • Typical post-pruning sequence lengths are <=197 tokens
  • Integrated into pack-attend-unpack pipeline
  • Compares to FlashAttention-2 varlen and PyTorch NestedTensor SDPA

Entities

Institutions

  • arXiv

Sources