Sparse Attention Distillation Enables Simpler Sequential Replacements in Transformers
A new paper on arXiv (2605.18865) proposes a method to replace computationally expensive self-attention layers in pretrained vision transformers with simpler sequential modules, using sparse attention distillation. The authors observe that transformer layers exhibit diverse sparsity patterns, suggesting that some layers can be approximated by simpler mappings without performance loss. They introduce a plug-and-play layer-wise distillation framework that selectively replaces attention with sequential modules in a controlled, group-wise manner. The approach aims to reduce inference costs while maintaining model quality. The paper is a cross submission and focuses on enabling more efficient transformer architectures through attention replacement.
Key facts
- Paper arXiv:2605.18865 proposes sparse attention distillation for replacing attention with sequential modules.
- Self-attention's quadratic token interaction cost makes inference expensive.
- Naive substitution of attention with sequential modules is often lossy at larger scales.
- The method uses a plug-and-play layer-wise distillation framework.
- It targets pretrained vision transformer models.
- Controlled group-wise replacements are performed under fixed training budget.
- The approach leverages diverse sparsity patterns across transformer layers.
- The paper is a cross submission type.
Entities
Institutions
- arXiv