New Triton Kernel Reduces Dispatch Overhead in Pruned Vision Transformers

ai-technology · 2026-04-20

A technical paper published on arXiv (ID: 2604.15408v1) addresses inefficiencies in executing pruned Vision Transformers (ViTs). While token pruning methods theoretically reduce attention FLOPs quadratically by eliminating uninformative patches, actual wall-clock latency improvements are limited by dispatch overhead. At typical post-pruning sequence lengths of 197 tokens or fewer, matrix arithmetic completes in microseconds, but host-side dispatch consumes 60-90 microseconds. The researchers developed a lightweight, bidirectional Triton attention kernel with a dispatch floor of approximately 40 microseconds, about 1.5 times lower than FlashAttention-2's variable-length implementation. This kernel is integrated into a complete pack-attend-unpack pipeline. The system achieves up to 2.24 times higher end-to-end throughput compared to padded PyTorch SDPA implementations. The work highlights a bottleneck where existing variable-length attention APIs—including FlashAttention-2's varlen and PyTorch's NestedTensor SDPA—fail to translate FLOP reductions into proportional speed gains. The focus is on making computational savings from pruning more visible in practical runtime performance.

Key facts

Paper arXiv ID: 2604.15408v1
Addresses dispatch overhead in pruned Vision Transformers
Host-side dispatch consumes 60-90 microseconds
New Triton kernel reduces dispatch floor to ~40 microseconds
Achieves up to 2.24x end-to-end throughput over padded PyTorch SDPA
Typical post-pruning sequence lengths are <=197 tokens
Integrated into pack-attend-unpack pipeline
Compares to FlashAttention-2 varlen and PyTorch NestedTensor SDPA

New Triton Kernel Reduces Dispatch Overhead in Pruned Vision Transformers

Key facts

Entities

Institutions

Sources