JetViT: Efficient High-Resolution Vision Transformer with Post-Training Attention Search
A group of researchers has unveiled JetViT, a series of hybrid-architecture Vision Transformer (ViT) models that achieve the same accuracy as leading full-attention vision foundation models while significantly enhancing inference efficiency for high-resolution images. The primary breakthrough is the Post-Training Attention Search, an acceleration framework that transforms pre-trained full-attention ViTs into efficient hybrid-attention models by identifying and substituting unnecessary full-attention blocks with linear or window-attention alternatives. This framework retains MLP and attention weights from the original model and navigates the design space through three essential phases: optimizing linear-attention block design, determining the optimal mix of linear and window-attention blocks, and recognizing and maintaining vital full-attention blocks. JetViT shows substantial efficiency improvements in high-resolution image tasks without compromising accuracy. The research is available on arXiv under ID 2605.26636.
Key facts
- JetViT is a family of hybrid-architecture Vision Transformer models.
- It matches accuracy of state-of-the-art full-attention vision foundation models.
- It achieves higher inference efficiency on high-resolution images.
- Core approach is Post-Training Attention Search.
- Post-Training Attention Search converts full-attention ViTs to hybrid-attention variants.
- It replaces redundant full-attention blocks with linear or window-attention blocks.
- The framework inherits MLP and attention weights from the base model.
- Three key steps: optimize linear-attention block design, find best combination of linear and window attention, identify critical full-attention blocks.
Entities
Institutions
- arXiv