New Triton Kernel Reduces Dispatch Overhead in Pruned Vision Transformers
A technical paper published on arXiv (ID: 2604.15408v1) addresses inefficiencies in executing pruned Vision Transformers (ViTs). While token pruning methods theoretically reduce attention FLOPs quadratically by eliminating uninformative patches, actual wall-clock latency improvements are limited by dispatch overhead. At typical post-pruning sequence lengths of 197 tokens or fewer, matrix arithmetic completes in microseconds, but host-side dispatch consumes 60-90 microseconds. The researchers developed a lightweight, bidirectional Triton attention kernel with a dispatch floor of approximately 40 microseconds, about 1.5 times lower than FlashAttention-2's variable-length implementation. This kernel is integrated into a complete pack-attend-unpack pipeline. The system achieves up to 2.24 times higher end-to-end throughput compared to padded PyTorch SDPA implementations. The work highlights a bottleneck where existing variable-length attention APIs—including FlashAttention-2's varlen and PyTorch's NestedTensor SDPA—fail to translate FLOP reductions into proportional speed gains. The focus is on making computational savings from pruning more visible in practical runtime performance.
Key facts
- Paper arXiv ID: 2604.15408v1
- Addresses dispatch overhead in pruned Vision Transformers
- Host-side dispatch consumes 60-90 microseconds
- New Triton kernel reduces dispatch floor to ~40 microseconds
- Achieves up to 2.24x end-to-end throughput over padded PyTorch SDPA
- Typical post-pruning sequence lengths are <=197 tokens
- Integrated into pack-attend-unpack pipeline
- Compares to FlashAttention-2 varlen and PyTorch NestedTensor SDPA
Entities
Institutions
- arXiv